In the foundations of modern probability, as laid out by Kolmogorov, the basic objects of study are constructed in the following order:

  1. Firstly, one selects a sample space {\Omega}, whose elements {\omega} represent all the possible states that one’s stochastic system could be in.
  2. Then, one selects a {\sigma}-algebra {{\mathcal B}} of events {E} (modeled by subsets of {\Omega}), and assigns each of these events a probability {{\bf P}(E) \in [0,1]} in a countably additive manner, so that the entire sample space has probability {1}.
  3. Finally, one builds (commutative) algebras of random variables {X} (such as complex-valued random variables, modeled by measurable functions from {\Omega} to {{\bf C}}), and (assuming suitable integrability or moment conditions) one can assign expectations {\mathop{\bf E} X} to each such random variable.

In measure theory, the underlying measure space {\Omega} plays a prominent foundational role, with the measurable sets and measurable functions (the analogues of the events and the random variables) always being viewed as somehow being attached to that space. In probability theory, in contrast, it is the events and their probabilities that are viewed as being fundamental, with the sample space {\Omega} being abstracted away as much as possible, and with the random variables and expectations being viewed as derived concepts. See Notes 0 for further discussion of this philosophy.

However, it is possible to take the abstraction process one step further, and view the algebra of random variables and their expectations as being the foundational concept, and ignoring both the presence of the original sample space, the algebra of events, or the probability measure.

There are two reasons for wanting to shed (or abstract away) these previously foundational structures. Firstly, it allows one to more easily take certain types of limits, such as the large {n} limit {n \rightarrow \infty} when considering {n \times n} random matrices, because quantities built from the algebra of random variables and their expectations, such as the normalised moments of random matrices tend to be quite stable in the large {n} limit (as we have seen in previous notes), even as the sample space and event space varies with {n}. (This theme of using abstraction to facilitate the taking of the large {n} limit also shows up in the application of ergodic theory to combinatorics via the correspondence principle; see this previous blog post for further discussion.)

Secondly, this abstract formalism allows one to generalise the classical, commutative theory of probability to the more general theory of non-commutative probability theory, which does not have a classical underlying sample space or event space, but is instead built upon a (possibly) non-commutative algebra of random variables (or “observables”) and their expectations (or “traces”). This more general formalism not only encompasses classical probability, but also spectral theory (with matrices or operators taking the role of random variables, and the trace taking the role of expectation), random matrix theory (which can be viewed as a natural blend of classical probability and spectral theory), and quantum mechanics (with physical observables taking the role of random variables, and their expected value on a given quantum state being the expectation). It is also part of a more general “non-commutative way of thinking” (of which non-commutative geometry is the most prominent example), in which a space is understood primarily in terms of the ring or algebra of functions (or function-like objects, such as sections of bundles) placed on top of that space, and then the space itself is largely abstracted away in order to allow the algebraic structures to become less commutative. In short, the idea is to make algebra the foundation of the theory, as opposed to other possible choices of foundations such as sets, measures, categories, etc..

[Note that this foundational preference is to some extent a metamathematical one rather than a mathematical one; in many cases it is possible to rewrite the theory in a mathematically equivalent form so that some other mathematical structure becomes designated as the foundational one, much as probability theory can be equivalently formulated as the measure theory of probability measures. However, this does not negate the fact that a different choice of foundations can lead to a different way of thinking about the subject, and thus to ask a different set of questions and to discover a different set of proofs and solutions. Thus it is often of value to understand multiple foundational perspectives at once, to get a truly stereoscopic view of the subject.]

It turns out that non-commutative probability can be modeled using operator algebras such as {C^*}-algebras, von Neumann algebras, or algebras of bounded operators on a Hilbert space, with the latter being accomplished via the Gelfand-Naimark-Segal construction. We will discuss some of these models here, but just as probability theory seeks to abstract away its measure-theoretic models, the philosophy of non-commutative probability is also to downplay these operator algebraic models once some foundational issues are settled.

When one generalises the set of structures in one’s theory, for instance from the commutative setting to the non-commutative setting, the notion of what it means for a structure to be “universal”, “free”, or “independent” can change. The most familiar example of this comes from group theory. If one restricts attention to the category of abelian groups, then the “freest” object one can generate from two generators {e,f} is the free abelian group of commutative words {e^n f^m} with {n,m \in {\bf Z}}, which is isomorphic to the group {{\bf Z}^2}. If however one generalises to the non-commutative setting of arbitrary groups, then the “freest” object that can now be generated from two generators {e,f} is the free group {{\Bbb F}_2} of non-commutative words {e^{n_1} f^{m_1} \ldots e^{n_k} f^{m_k}} with {n_1,m_1,\ldots,n_k,m_k \in {\bf Z}}, which is a significantly larger extension of the free abelian group {{\bf Z}^2}.

Similarly, when generalising classical probability theory to non-commutative probability theory, the notion of what it means for two or more random variables to be independent changes. In the classical (commutative) setting, two (bounded, real-valued) random variables {X, Y} are independent if one has

\displaystyle \mathop{\bf E} f(X) g(Y) = 0

whenever {f, g: {\bf R} \rightarrow {\bf R}} are well-behaved functions (such as polynomials) such that {\mathop{\bf E} f(X)}, {\mathop{\bf E} g(Y)} both vanish. In the non-commutative setting, one can generalise the above definition to two commuting bounded self-adjoint variables; this concept is useful for instance in quantum probability, which is an abstraction of the theory of observables in quantum mechanics. But for two (bounded, self-adjoint) non-commutative random variables {X, Y}, the notion of classical independence no longer applies. As a substitute, one can instead consider the notion of being freely independent (or free for short), which means that

\displaystyle \mathop{\bf E} f_1(X) g_1(Y) \ldots f_k(X) g_k(Y) = 0

whenever {f_1,g_1,\ldots,f_k,g_k: {\bf R} \rightarrow {\bf R}} are well-behaved functions such that all of {\mathop{\bf E} f_1(X), \mathop{\bf E} g_1(Y), \ldots, \mathop{\bf E} f_k(X), \mathop{\bf E} g_k(Y)} vanish.

The concept of free independence was introduced by Voiculescu, and its study is now known as the subject of free probability. We will not attempt a systematic survey of this subject here; for this, we refer the reader to the surveys of Speicher and of Biane. Instead, we shall just discuss a small number of topics in this area to give the flavour of the subject only.

The significance of free probability to random matrix theory lies in the fundamental observation that random matrices which are independent in the classical sense, also tend to be independent in the free probability sense, in the large {n} limit {n \rightarrow \infty}. (This is only possible because of the highly non-commutative nature of these matrices; as we shall see, it is not possible for non-trivial commuting independent random variables to be freely independent.) Because of this, many tedious computations in random matrix theory, particularly those of an algebraic or enumerative combinatorial nature, can be done more quickly and systematically by using the framework of free probability, which by design is optimised for algebraic tasks rather than analytical ones.

Much as free groups are in some sense “maximally non-commutative”, freely independent random variables are about as far from being commuting as possible. For instance, if {X, Y} are freely independent and of expectation zero, then {\mathop{\bf E} XYXY} vanishes, but {\mathop{\bf E} XXYY} instead factors as {(\mathop{\bf E} X^2) (\mathop{\bf E} Y^2)}. As a consequence, the behaviour of freely independent random variables can be quite different from the behaviour of their classically independent commuting counterparts. Nevertheless there is a remarkably strong analogy between the two types of independence, in that results which are true in the classically independent case often have an interesting analogue in the freely independent setting. For instance, the central limit theorem (Notes 2) for averages of classically independent random variables, which roughly speaking asserts that such averages become gaussian in the large {n} limit, has an analogue for averages of freely independent variables, the free central limit theorem, which roughly speaking asserts that such averages become semicircular in the large {n} limit. One can then use this theorem to provide yet another proof of Wigner’s semicircle law (Notes 4).

Another important (and closely related) analogy is that while the distribution of sums of independent commutative random variables can be quickly computed via the characteristic function (i.e. the Fourier transform of the distribution), the distribution of sums of freely independent non-commutative random variables can be quickly computed using the Stieltjes transform instead (or with closely related objects, such as the {R}-transform of Voiculescu). This is strongly reminiscent of the appearance of the Stieltjes transform in random matrix theory, and indeed we will see many parallels between the use of the Stieltjes transform here and in Notes 4.

As mentioned earlier, free probability is an excellent tool for computing various expressions of interest in random matrix theory, such as asymptotic values of normalised moments in the large {n} limit {n \rightarrow \infty}. Nevertheless, as it only covers the asymptotic regime in which {n} is sent to infinity while holding all other parameters fixed, there are some aspects of random matrix theory to which the tools of free probability are not sufficient by themselves to resolve (although it can be possible to combine free probability theory with other tools to then answer these questions). For instance, questions regarding the rate of convergence of normalised moments as {n \rightarrow \infty} are not directly answered by free probability, though if free probability is combined with tools such as concentration of measure (Notes 1) then such rate information can often be recovered. For similar reasons, free probability lets one understand the behaviour of {k^{th}} moments as {n \rightarrow \infty} for fixed {k}, but has more difficulty dealing with the situation in which {k} is allowed to grow slowly in {n} (e.g. {k = O(\log n)}). Because of this, free probability methods are effective at controlling the bulk of the spectrum of a random matrix, but have more difficulty with the edges of that spectrum (as well as with related concepts such as the operator norm, Notes 3) as well as with fine-scale structure of the spectrum. Finally, free probability methods are most effective when dealing with matrices that are Hermitian with bounded operator norm, largely because the spectral theory of bounded self-adjoint operators in the infinite-dimensional setting of the large {n} limit is non-pathological. (This is ultimately due to the stable nature of eigenvalues in the self-adjoint setting; see this previous blog post for discussion.) For non-self-adjoint operators, free probability needs to be augmented with additional tools, most notably by bounds on least singular values, in order to recover the required stability for the various spectral data of random matrices to behave continuously with respect to the large {n} limit. We will discuss this latter point in a later set of notes.

— 1. Abstract probability theory —

We will now slowly build up the foundations of non-commutative probability theory, which seeks to capture the abstract algebra of random variables and their expectations. The impatient reader who wants to move directly on to free probability theory may largely jump straight to the final definition at the end of this section, but it can be instructive to work with these foundations for a while to gain some intuition on how to handle non-commutative probability spaces.

To motivate the formalism of abstract (non-commutative) probability theory, let us first discuss the three key examples of non-commutative probability spaces, and then abstract away all features that are not shared in common by all three examples.

Example 1: Random scalar variables. We begin with classical probability theory – the study of scalar random variables. In order to use the powerful tools of complex analysis (such as the Stieltjes transform), it is very convenient to allow our random variables to be complex valued. In order to meaningfully take expectations, we would like to require all our random variables to also be absolutely integrable. But this requirement is not sufficient by itself to get good algebraic structure, because the product of two absolutely integrable random variables need not be absolutely integrable. As we want to have as much algebraic structure as possible, we will therefore restrict attention further, to the collection {L^{\infty-}:= \bigcap_{k=1}^\infty L^k(\Omega)} of random variables with all moments finite. This class is closed under multiplication, and all elements in this class have a finite trace (or expectation). One can of course restrict further, to the space {L^\infty= L^\infty(\Omega)} of (essentially) bounded variables, but by doing so one loses important examples of random variables, most notably gaussians, so we will work instead with the space {L^{\infty-}}. (This will cost us some analytic structure – in particular, {L^{\infty-}} will not be a Banach space, in contrast to {L^\infty} – but as our focus is on the algebraic structure, this will be an acceptable price to pay.)

The space {L^{\infty-}} of complex-valued random variables with all moments finite now becomes an algebra over the complex numbers {{\bf C}}; i.e. it is a vector space over {{\bf C}} that is also equipped with a bilinear multiplication operation {\cdot: L^{\infty-} \times L^{\infty-} \rightarrow L^{\infty-}} that obeys the associative and distributive laws. It is also commutative, but we will suppress this property, as it is not shared by the other two examples we will be discussing. The deterministic scalar {1} then plays the role of the multiplicative unit in this algebra.

In addition to the usual algebraic operations, one can also take the complex conjugate or adjoint {X^* = \overline{X}} of a complex-valued random variable {X}. This operation {*: L^{\infty-} \rightarrow L^{\infty-}} interacts well with the other algebraic operations: it is in fact an anti-automorphism on {L^{\infty-}}, which means that it preserves addition {(X+Y)^* = X^*+Y^*}, reverses multiplication {(XY)^* = Y^* X^*}, is anti-homogeneous ({(cX)^* = \overline{c} X^*} for {c \in {\bf C}}), and it is invertible. In fact, it is its own inverse ({(X^*)^* = X}), and is thus an involution.

This package of properties can be summarised succinctly by stating that the space {L^{\infty-}} of bounded complex-valued random variables is a (unital) {*}-algebra.

The expectation operator {\mathop{\bf E}} can now be viewed as a map {\mathop{\bf E}: L^{\infty-} \rightarrow {\bf C}}. It obeys some obvious properties, such as being linear (i.e. {\mathop{\bf E}} is a linear functional on {L^\infty}). In fact it is {*}-linear, which means that it is linear and also that {\mathop{\bf E}(X^*) = \overline{\mathop{\bf E} X}} for all {X}. We also clearly have {\mathop{\bf E} 1 = 1}. We will remark on some additional properties of expectation later.

Example 2: Deterministic matrix variables. A second key example is that of (finite-dimensional) spectral theory – the theory of {n \times n} complex-valued matrices {X \in M_n({\bf C})}. (One can also consider infinite-dimensional spectral theory, of course, but for simplicity we only consider the finite-dimensional case in order to avoid having to deal with technicalities such as unbounded operators.) Like the space {L^{\infty-}} considered in the previous example, {M_n({\bf C})} is a {*}-algebra, where the multiplication operation is of course given by matrix multiplication, the identity is the matrix identity {1 = I_n}, and the involution {X \mapsto X^*} is given by the matrix adjoint operation. On the other hand, as is well-known, this {*}-algebra is not commutative (for {n \geq 2}).

The analogue of the expectation operation here is the normalised trace {\tau(X) := \frac{1}{n} \hbox{tr} X}. Thus {\tau: M_n({\bf C}) \rightarrow {\bf C}} is a *-linear functional on {M_n({\bf C})} that maps {1} to {1}. The analogy between expectation and normalised trace is particularly evident when comparing the moment method for scalar random variables (based on computation of the moments {\mathop{\bf E} X^k}) with the moment method in spectral theory (based on a computation of the moments {\tau(X^k)}).

Example 3: Random matrix variables. Random matrix theory combines classical probability theory with finite-dimensional spectral theory, with the random variables of interest now being the random matrices {X \in L^{\infty-} \otimes M_n({\bf C})}, all of whose entries have all moments finite. It is not hard to see that this is also a {*}-algebra with identity {1 = I_n}, which again will be non-commutative for {n \geq 2}. The normalised trace {\tau} here is given by

\displaystyle \tau(X) := \mathop{\bf E} \frac{1}{n} \hbox{tr} X,

thus one takes both the normalised matrix trace and the probabilistic expectation, in order to arrive at a deterministic scalar (i.e. a complex number). As before, we see that {\tau: L^{\infty-} \otimes M_n({\bf C}) \rightarrow {\bf C}} is a {*}-linear functional that maps {1} to {1}. As we saw in Notes 3, the moment method for random matrices is based on a computation of the moments {\tau(X^k) = \mathop{\bf E} \frac{1}{n} \hbox{tr} X^k}.

Let us now simultaneously abstract the above three examples, but reserving the right to impose some additional axioms as needed:

Definition 1 (Non-commutative probability space, preliminary definition) A non-commutative probability space (or more accurately, a potentially non-commutative probability space) {({\mathcal A},\tau)} will consist of a (potentially non-commutative) {*}-algebra {{\mathcal A}} of (potentially non-commutative) random variables (or observables) with identity {1}, together with a trace {\tau: {\mathcal A} \rightarrow {\bf C}}, which is a {*}-linear functional that maps {1} to {1}. This trace will be required to obey a number of additional axioms which we will specify later in this set of notes.

This definition is not yet complete, because we have not fully decided on what axioms to enforce for these spaces, but for now let us just say that the three examples {(L^{\infty-}, \mathop{\bf E})}, {(M_n({\bf C}), \frac{1}{n} \hbox{tr})}, {(L^{\infty-} \otimes M_n({\bf C}), \mathop{\bf E} \frac{1}{n} \hbox{tr})} given above will obey these axioms and serve as model examples of non-commutative probability spaces. We mention that the requirement {\tau(1)=1} can be viewed as an abstraction of Kolmogorov’s axiom that the sample space has probability {1}.

To motivate the remaining axioms, let us try seeing how some basic concepts from the model examples carry over to the abstract setting.

Firstly, we recall that every scalar random variable {X \in L^{\infty-}} has a probability distribution {\mu_X}, which is a probability measure on the complex plane {{\bf C}}; if {X} is self-adjoint (i.e. real valued), so that {X=X^*}, then this distribution is supported on the real line {{\bf R}}. The condition that {X} lie in {L^{\infty-}} ensures that this measure is rapidly decreasing, in the sense that {\int_{\bf C} |z|^k\ d\mu_X(x) <\infty} for all {k}. The measure {\mu_X} is related to the moments {\tau(X^k) = \mathop{\bf E} X^k} by the formula

\displaystyle \tau(X^k) = \int_{\bf C} z^k\ d\mu_X(z) \ \ \ \ \ (1)

 

for {k=0,1,2,\ldots}. In fact, one has the more general formula

\displaystyle \tau(X^k (X^*)^l) = \int_{\bf C} z^k \overline{z}^l\ d\mu_X(z) \ \ \ \ \ (2)

 

for {k,l=0,1,2,\ldots}.

Similarly, every deterministic matrix {X \in M_n({\bf C})} has a empirical spectral distribution {\mu_X = \frac{1}{n} \sum_{i=1}^n \delta_{\lambda_i(X)}}, which is a probability measure on the complex plane {{\bf C}}. Again, if {X} is self-adjoint, then distribution is supported on the real line {{\bf R}}. This measure is related to the moments {\tau(X^k) = \frac{1}{n} \hbox{tr} X^k} by the same formula (1) as in the case of scalar random variables. Because {n} is finite, this measure is finitely supported (and in particular is rapidly decreasing). As for (2), the spectral theorem tells us that this formula holds when {X} is normal (i.e. {XX^*=X^* X}), and in particular if {X} is self-adjoint (of course, in this case (2) collapses to (1)), but is not true in general. Note that this subtlety does not appear in the case of scalar random variables because in this commutative setting, all elements are automatically normal.

Finally, for random matrices {X \in L^{\infty-} \otimes M_n({\bf C})}, we can form the expected empirical spectral distribution {\mu_X = \mathop{\bf E} \frac{1}{n} \sum_{i=1}^n \delta_{\lambda_i(X)}}, which is again a rapidly decreasing probability measure on {{\bf C}}, which is supported on {{\bf R}} if {X} is self-adjoint. This measure is again related to the moments {\tau(X^k) = \mathop{\bf E} \frac{1}{n} \hbox{tr} X^k} by the formula (1), and also by (2) if {X} is normal.

Now let us see whether we can set up such a spectral measure {\mu_X} for an element {X} in an abstract non-commutative probability space {({\mathcal A},\tau)}. From the above examples, it is natural to try to define this measure through the formula (1), or equivalently (by linearity) through the formula

\displaystyle \tau( P(X) ) = \int_{\bf C} P(z)\ d\mu_X(z) \ \ \ \ \ (3)

 

whenever {P: {\bf C} \rightarrow {\bf C}} is a polynomial with complex coefficients (note that one can define {P(X)} without difficulty as {{\mathcal X}} is a {*}-algebra). In the normal case, one may hope to work with the more general formula

\displaystyle \tau( P(X,X^*) ) = \int_{\bf C} P(z,\overline{z})\ d\mu_X(z) \ \ \ \ \ (4)

 

whenever {P: {\bf C} \times {\bf C} \rightarrow {\bf C}} is a polynomial of two complex variables (note that {P(X,X^*)} can be defined unambiguously precisely when {X} is normal).

It is tempting to apply the Riesz representation theorem to (3) to define the desired measure {\mu_X}, perhaps after first using the Weierstrass approximation theorem to pass from polynomials to continuous functions. However, there are multiple technical issues with this idea:

  • In order for the polynomials to be dense in the continuous functions in the uniform topology on the support of {\mu_X}, one needs the intended support {\sigma(X)} of {\mu_X} to be on the real line {{\bf R}}, or else one needs to work with the formula (4) rather than (3). Also, one also needs the intended support {\sigma(X)} to be bounded for the Weierstrass approximation theorem to apply directly.
  • In order for the Riesz representation theorem to apply, the functional {P \mapsto \tau(P(X,X^*))} (or {P \mapsto \tau(P(X))}) needs to be continuous in the uniform topology, thus one must be able to obtain a bound of the form {|\tau(P(X,X^*))| \leq C \sup_{z \in \sigma(X)} |P(z,\overline{z})|} for some (preferably compact) set {\sigma(X)}. (To get a probability measure, one in fact needs to have {C=1}.)
  • In order to get a probability measure rather than a signed measure, one also needs some non-negativity: {\tau(P(X,X^*))} needs to be non-negative whenever {P(z,\overline{z}) \geq 0} for {z} in the intended support {\sigma(X)}.

To resolve the non-negativity issue, we impose an additional axiom on the non-commutative probability space {({\mathcal A},\tau)}:

  • (Non-negativity) For any {X \in {\mathcal A}}, we have {\tau(X^* X) \geq 0}. (Note that {X^* X} is self-adjoint and so its trace {\tau(X^* X)} is necessarily a real number.)

In the language of von Neumann algebras, this axiom (together with the normalisation {\tau(1)=1}) is essentially asserting that {\tau} is a state. Note that this axiom is obeyed by all three model examples, and is also consistent with (4). It is the noncommutative analogue of the Kolmogorov axiom that all events have non-negative probability.

With this axiom, we can now define an positive semi-definite inner product {\langle, \rangle_{L^2(\tau)}} on {{\mathcal A}} by the formula

\displaystyle \langle X, Y \rangle_{L^2(\tau)} := \tau(X^* Y).

This obeys the usual axioms of an inner product, except that it is only positive semi-definite rather than positive definite. One can impose positive definiteness by adding an axiom that the trace {\tau} is faithful, which means that {\tau(X^* X) = 0} if and only if {X=0}. However, we will not need the faithfulness axiom here.

Without faithfulness, {{\mathcal A}} is a semi-definite inner product space with semi-norm

\displaystyle \| X \|_{L^2(\tau)} := (\langle X, X \rangle_{L^2(\tau)})^{1/2} = \tau(X^* X)^{1/2}.

In particular, we have the Cauchy-Schwarz inequality

\displaystyle |\langle X,Y\rangle_{L^2(\tau)}| \leq \|X\|_{L^2(\tau)} \|Y\|_{L^2(\tau)}.

This leads to an important monotonicity:

Exercise 2 (Monotonicity) Let {X} be a self-adjoint element of a non-commutative probability space {({\mathcal A}, \tau)}. Show that we have the monotonicity relationships

\displaystyle |\tau(X^{2k-1})|^{1/(2k-1)} \leq |\tau(X^{2k})|^{1/(2k)} \leq |\tau(X^{2k+2})|^{1/(2k+2)}

for any {k \geq 0}.

As a consequence, we can define the spectral radius {\rho(X)} of a self-adjoint element {X} by the formula

\displaystyle \rho(X) := \lim_{k \rightarrow \infty} |\tau(X^{2k})|^{1/( 2k)}, \ \ \ \ \ (5)

 

in which case we obtain the inequality

\displaystyle |\tau(X^k)| \leq \rho(X)^k \ \ \ \ \ (6)

 

for any {k=0,1,2,\ldots}. We then say that a self-adjoint element is bounded if its spectral radius is finite.

Example 3 In the case of random variables, the spectral radius is the essential supremum {\|X\|_{L^\infty}}, while for deterministic matrices, the spectral radius is the operator norm {\|X\|_{op}}. For random matrices, the spectral radius is the essential supremum {\|\|X\|_{op}\|_{L^\infty}} of the operator norm.

Guided by the model examples, we expect that a bounded self-adjoint element {X} should have a spectral measure {\mu_X} supported on the interval {[-\rho(X),\rho(X)]}. But how to show this? It turns out that one can proceed by tapping the power of complex analysis, and introducing the Stieltjes transform

\displaystyle s_X(z) := \tau( (X-z)^{-1} ) \ \ \ \ \ (7)

 

for complex numbers {z}. Now, this transform need not be defined for all {z} at present, because we do not know that {X-z} is invertible in {{\mathcal A}}. However, we can avoid this problem by working formally. Indeed, we have the formal Neumann series expansion

\displaystyle (X-z)^{-1} = - \frac{1}{z} - \frac{X}{z^2} - \frac{X^2}{z^3} - \ldots

which leads to the formal Laurent series expansion

\displaystyle s_X(z) = - \sum_{k=0}^\infty \frac{\tau(X^k)}{z^{k+1}}. \ \ \ \ \ (8)

 

If {X} is bounded self-adjoint, then from (6) we see that this formal series actually converges in the region {|z| > \rho(X)}. We will thus define the Stieltjes transform {s_X(z)} on the region {|z| > \rho(X)} by this series expansion (8), and then extend to as much of the complex plane as we can by analytic continuation. (There could in principle be some topological obstructions to this continuation, but we will soon see that the only place where singularities can occur is on the real interval {[-\rho(X),\rho(X)]}, and so no topological obstructions will appear. One can also work with the original definition (7) of the Stieltjes transform, but this requires imposing some additional analytic axioms on the non-commutative probability space, such as requiring that {{\mathcal A}} be a {C^*}-algebra or a von Neumann algebra, and I wish to avoid discussing these topics here as they are not the main focus of free probability theory.)

We now push the domain of definition of {s_X(z)} into the disk {\{ |z| \leq \rho(X)\}}. We need some preliminary lemmas.

Exercise 4 Let {X} be bounded self-adjoint. For any real number {R}, show that {\rho(R^2+X^2) = R^2 + \rho(X)^2}. (Hint: use (5), (6)).

Exercise 5 Let {X} be bounded normal. Show that {|\tau(X^k)| \leq \tau( (X^* X)^k )^{1/2} \leq \rho(X^* X)^{k/2}}.

Now let {R} be a large positive real number. The idea is to rewrite the (formal) Stieltjes transform {\tau((X-z)^{-1})} using the formal identity

\displaystyle (X-z)^{-1} = ((X+iR)-(z+iR))^{-1} \ \ \ \ \ (9)

 

and take Neumann series again to arrive at the formal expansion

\displaystyle s_X(z) = - \sum_{k=0}^\infty \frac{\tau((X+iR)^k)}{(z+iR)^{k+1}}. \ \ \ \ \ (10)

 

From the previous two exercises we see that

\displaystyle |\tau((X+iR)^k)| \leq (R^2+\rho(X)^2)^{k/2}

and so the above Laurent series converges for {|z+iR| > (R^2+\rho(X)^2)^{1/2}}.

Exercise 6 Give a rigorous proof that the two series (8), (10) agree for {z} large enough.

We have thus extended {s_X(z)} analytically to the region {\{ z: |z+iR| > (R^2+\rho(X)^2)^{1/2}\}}. Letting {R \rightarrow \infty}, we obtain an extension of {s_X(z)} to the upper half-plane {\{ z: \hbox{Im}(z) > 0 \}}. A similar argument (shifting by {-iR} instead of {+iR}) gives an extension to the lower half-plane, thus defining {s_X(z)} analytically everywhere except on the interval {[-\rho(X),\rho(X)]}.

On the other hand, it is not possible to analytically extend {s_X(z)} to the region {\{ z: |z| > \rho(X) - \varepsilon \}} for any {0 < \varepsilon < \rho(X)}. Indeed, if this were the case, then from the Cauchy integral formula (applied at infinity), we would have the identity

\displaystyle \tau(X^k) = -\frac{1}{2\pi i} \int_{|z|=R} s_{X}(z) z^{k}\ dz

for any {R > \rho(X)-\varepsilon}, which when combined with (5) implies that {\rho(X) \leq R} for all such {R}, which is absurd. Thus the spectral radius {\rho(X)} can also be interpreted as the radius of the smallest ball centred at the origin outside of which the Stieltjes transform can be analytically continued.

Now that we have the Stieltjes transform everywhere outside of {[-\rho(X),\rho(X)]}, we can use it to derive an important bound (which will soon be superceded by (3), but will play a key role in the proof of that stronger statement):

Proposition 7 (Boundedness) Let {X} be bounded self-adjoint, and let {P: {\bf C} \rightarrow {\bf C}} be a polynomial. Then

\displaystyle |\tau(P(X))| \leq \sup_{x \in [-\rho(X),\rho(X)]} |P(x)|.

Proof: (Sketch) We can of course assume that {P} is non-constant, as the claim is obvious otherwise. From Exercise 5 (replacing {P} with {P \overline{P}}, where {\overline{P}} is the polynomial whose coefficients are the complex conjugate of that of {P}) we may reduce to the case when {P} has real coefficients, so that {P(X)} is self-adjoint. Since {X} is bounded, it is not difficult (using (5), (6)) to show that {P(X)} is bounded also (Exercise!).

As {P(X)} is bounded self-adjoint, it has a Stieltjes transform defined outside of {[-\rho(P(X)),\rho(P(X))]}, which for large {z} is given by the formula

\displaystyle s_{P(X)}(z) = - \sum_{k=0}^\infty \frac{\tau(P(X)^k)}{z^k}. \ \ \ \ \ (11)

 

By the previous discussion, to establish the proposition it will suffice to show that the Stieltjes transform can be continued to the domain

\displaystyle \Omega := \{ z \in {\bf C}: z > \sup_{x \in [-\rho(X),\rho(X)]} |P(x)| \}.

For this, we observe the partial fractions decomposition

\displaystyle \frac{1}{P(w)-z} = \sum_{\zeta: P(\zeta)=z} \frac{P'(\zeta)^{-1}}{w-\zeta}

of {(P(w)-z)^{-1}} into linear combinations of {(w-\zeta)^{-1}}, at least when the roots of {P-z} are simple. Thus, formally, at least, we have the identity

\displaystyle s_{P(X)}(z) = \sum_{\zeta: P(\zeta)=z} \frac{1}{P'(\zeta)} s_X(\zeta).

One can verify this identity is consistent with (11) for {z} sufficiently large. (Exercise! Hint: First do the case when {X} is a scalar, then expand in Taylor series and compare coefficients, then use the agreement of the Taylor series to do the general case.)

If {z} is in the domain {\Omega}, then all the roots {\zeta} of {P(\zeta)=z} lie outside the interval {[-\rho(X),\rho(X)]}. So we can use the above formula as a definition of {s_{P(X)}(z)}, at least for those {z \in \Omega} for which the roots of {P-z} are simple; but there are only finitely many exceptional {z} (arising from zeroes of {P'}) and one can check (Exercise! Hint: use the analytic nature of {s_X} and the residue theoremto rewrite parts of {s_{P(X)}(z)} as a contour integral.) that the singularities here are removable. It is easy to see (Exercise!) that {s_{P(X)}} is holomorphic outside of these removable singularities, and the claim follows. \Box

Exercise 8 Fill in the steps marked (Exercise!) in the above proof.

From Proposition 7 and the Weierstrass approximation theorem, we see that the linear functional {P \mapsto \tau(P(X))} can be uniquely extended to a bounded linear functional on {C([-\rho(X),\rho(X)])}, with an operator norm {1}. Applying the Riesz representation theorem, we thus can find a unique Radon measure (or equivalently, Borel measure) {\mu_X} on {[-\rho(X),\rho(X)]} of total variation {1} obeying the identity (3) for all {P}. In particular, setting {P=1} see that {\mu_X} has total mass {1}; since it also has total variation {1}, it must be a probability measure. We have thus shown the fundamental

Theorem 9 (Spectral theorem for bounded self-adjoint elements) Let {X} be a bounded self-adjoint element of a non-commutative probability space {({\mathcal A},\tau)}. Then there exists a unique Borel probability measure {\mu_X} on {[-\rho(X),\rho(X)]} (known as the spectral measure of {X}) such that (3) holds for all polynomials {P: {\bf C} \rightarrow {\bf C}}.

Remark 10 If one assumes some completeness properties of the non-commutative probability space, such as that {{\mathcal A}} is a {C^*}-algebra or a von Neumann algebra, one can use this theorem to meaningfully define {F(X)} for other functions {F: [-\rho(X),\rho(X)] \rightarrow {\bf C}} than polynomials; specifically, one can do this for continuous functions {F} if {{\mathcal A}} is a {C^*}-algebra, and for {L^\infty(\mu_X)} functions {F} if {{\mathcal A}} is a von Neumann algebra. Thus for instance we can start define absolute values {|X|}, or square roots {|X|^{1/2}}, etc.. Such an assignment {F \mapsto F(X)} is known as a functional calculus; it can be used for instance to go back and make rigorous sense of the formula (7). A functional calculus is a very convenient tool to have in operator algebra theory, and for that reason one often completes a non-commutative probability space into a {C^*}-algebra or von Neumann algebra, much as how it is often convenient to complete the rationals and work instead with the reals. However, we will proceed here instead by working with a (possibly incomplete) non-commutative probability space, and working primarily with formal expressions (e.g. formal power series in {z}) without trying to evaluate such expressions in some completed space. We can get away with this because we will be working exclusively in situations in which the spectrum of a random variable can be reconstructed exactly from its moments (which is in particular true in the case of bounded random variables). For unbounded random variables, one must usually instead use the full power of functional analysis, and work with the spectral theory of unbounded operators on Hilbert spaces.

Exercise 11 Let {X} be a bounded self-adjoint element of a non-commutative probability space, and let {\mu_X} as the spectral measure of {X}. Establish the formula

\displaystyle s_X(z) = \int_{[-\rho(X),\rho(X)]} \frac{1}{x-z}\ d\mu_X(x)

for all {z \in {\bf C} \backslash [-\rho(X),\rho(X)]}. Conclude that the support of the spectral measure {\mu_X} must contain at least one of the two points {-\rho(X), \rho(X)}.

Exercise 12 Let {X} be a bounded self-adjoint element of a non-commutative probability space with faithful trace. Show that {\rho(X)=0} if and only if {X=0}.

Remark 13 It is possible to also obtain a spectral theorem for bounded normal elements along the lines of the above theorem (with {\mu_X} now supported in a disk rather than in an interval, and with (3) replaced by (4)), but this is somewhat more complicated to show (basically, one needs to extend the self-adjoint spectral theorem to a pair of commuting self-adjoint elements, which is a little tricky to show by complex-analytic methods, as one has to use several complex variables).

The spectral theorem more or less completely describes the behaviour of a single (bounded self-adjoint) element {X} in a non-commutative probability space. As remarked above, it can also be extended to study multiple commuting self-adjoint elements. However, when one deals with multiple non-commuting elements, the spectral theorem becomes inadequate (and indeed, it appears that in general there is no usable substitute for this theorem). However, we can begin making a little bit of headway if we assume as a final (optional) axiom a very weak form of commutativity in the trace:

  • (Trace) For any two elements {X, Y}, we have {\tau(XY)=\tau(YX)}.

Note that this axiom is obeyed by all three of our model examples. From this axiom, we can cyclically permute products in a trace, e.g. {\tau(XYZ)=\tau(YZX)=\tau(ZXY)}. However, we cannot take non-cyclic permutations; for instance, {\tau(XYZ)} and {\tau(XZY)} are distinct in general. This axiom is a trivial consequence of the commutative nature of the complex numbers in the classical setting, but can play a more non-trivial role in the non-commutative setting. It is however possible to develop a large part of free probability without this axiom, if one is willing instead to work in the category of von Neumann algebras. Thus, we shall leave it as an optional axiom:

Definition 14 (Non-commutative probability space, final definition) A non-commutative probability space {({\mathcal A},\tau)} consists of a {*}-algebra {{\mathcal A}} with identity {1}, together with a {*}-linear functional {\tau: {\mathcal A} \rightarrow {\bf C}}, that maps {1} to {1} and obeys the non-negativity axiom. If {\tau} obeys the trace axiom, we say that the non-commutative probability space is tracial. If {\tau} obeys the faithfulness axiom, we say that the non-commutative probability space is faithful.

From this new axiom and the Cauchy-Schwarz inequality we can now get control on products of several non-commuting elements:

Exercise 15 Let {X_1,\ldots,X_k} be bounded self-adjoint elements of a tracial non-commutative probability space {({\mathcal A},\tau)}. Show that

\displaystyle |\tau( X_1^{m_1} \ldots X_k^{m_k} )| \leq \rho(X_1)^{m_1} \ldots \rho(X_k)^{m_k}

for any non-negative integers {m_1,\ldots,m_k}. (Hint: Induct on {k}, and use Cauchy-Schwarz to split up the product as evenly as possible, using cyclic permutations to reduce the complexity of the resulting expressions.)

Exercise 16 Let {{\mathcal A} \cap L^\infty(\tau)} be those elements {X} in a tracial non-commutative probability space {({\mathcal A},\tau)} whose real and imaginary parts {\hbox{Re}(X) := \frac{X+X^*}{2}}, {\hbox{Im}(X) := \frac{X-X^*}{2i}} are bounded and self-adjoint; we refer to such elements simply as bounded elements. Show that this is a sub-*-algebra of {{\mathcal A}}.

This allows one to perform the following Gelfand-Naimark-Segal (GNS) construction. Recall that {{\mathcal A} \cap L^\infty(\tau)} has a positive semi-definite inner product {\langle,\rangle_{L^2(\tau)}}. We can perform the Hilbert space completion of this inner product space (quotienting out by the elements of zero norm), leading to a complex Hilbert space {L^2(\tau)} into which {{\mathcal A} \cap L^\infty(\tau)} can be mapped as a dense subspace by an isometry {\iota: {\mathcal A} \cap L^\infty(\tau) \rightarrow L^2(\tau)}. (This isometry is injective when {{\mathcal A}} is faithful, but will have a non-trivial kernel otherwise.) The space {{\mathcal A} \cap L^\infty(\tau)} acts on itself by multiplication, and thus also acts on the dense subspace {\iota({\mathcal A} \cap L^\infty(\tau))} of {L^2(\tau)}. We would like to extend this action to all of {L^2(\tau)}, but this requires an additional estimate:

Lemma 17 Let {({\mathcal A},\tau)} be a tracial non-commutative probability space. If {X, Y \in {\mathcal A} \cap L^\infty(\tau)} with {X} self-adjoint, then

\displaystyle \| XY \|_{L^2(\tau)} \leq \rho(X) \|Y\|_{L^2(\tau)}.

Proof: Squaring and cyclically permuting, it will suffice to show that

\displaystyle \tau( Y^* X^2 Y ) \leq \rho(X)^2 \tau( Y^* Y ).

Let {\varepsilon > 0} be arbitrary. By Weierstrass approximation, we can find a polynomial {P} with real coefficients such that {x^2 + P(x)^2 = \rho(X)^2 + O(\varepsilon)} on the interval {[-\rho(X),\rho(X)]}. By Proposition 7, we can thus write {X^2 + P(X)^2 = \rho(X)^2 + E} where {E} is self-adjoint with {\rho(E) = O(\varepsilon)}. Multiplying on the left by {Y^*} and on the right by {Y} and taking traces, we obtain

\displaystyle \tau( Y^* X^2 Y ) + \tau( Y^* P(X)^2 Y ) \leq \rho(X)^2 \tau(Y^* Y) + \tau( Y^* E Y ).

By non-negativity, {\tau(Y^* P(X)^2 Y) \geq 0}. By Exercise 15, we have {\tau( Y^* E Y ) = O_Y(\varepsilon)}. Sending {\varepsilon \rightarrow 0} we obtain the claim. \Box

As a consequence, we see that the self-adjoint elements {X} of {{\mathcal A} \cap L^\infty(\tau)} act in a bounded manner on all of {L^2(\tau)}, and so on taking real and imaginary parts, we see that the same is true for the non-self-adjoint elements too. Thus we can associate to each {X \in L^\infty(\tau)} a bounded linear transformation {\overline{X} \in B(L^2(\tau))} on the Hilbert space {L^2(\tau)}.

Exercise 18 (Gelfand-Naimark theorem) Show that the map {X \mapsto \overline{X}} is a {*}-isomorphism from {{\mathcal A} \cap L^\infty(\tau)} to a {*}-subalgebra of {B(L^2(\tau))}, and that one has the representation

\displaystyle \tau(X) = \langle e, \overline{X} e \rangle

for any {X \in L^\infty(\tau)}, where {e} is the unit vector {e := \iota(1)}.

Remark 19 The Gelfand-Naimark theorem required the tracial hypothesis only to deal with the error {E} in the proof of Lemma 17. One can also establish this theorem without this hypothesis, by assuming instead that the non-commutative space is a {C^*}-algebra; this provides a continuous functional calculus, so that we can replace {P} in the proof of Lemma 17 by a continuous function and dispense with {E} altogether. This formulation of the Gelfand-Naimark theorem is the one which is usually seen in the literature.

The Gelfand-Naimark theorem identifies {{\mathcal A} \cap L^\infty(\tau)} with a {*}-subalgebra of {B(L^2(\tau))}. The closure of this {*}-subalgebra in the weak operator topology is then a von Neumann algebra, which we denote as {L^\infty(\tau)}. As a consequence, we see that non-commutative probability spaces are closely related to von Neumann algebras (equipped with a tracial state {\tau}). However, we refrain from identifying the former completely with the latter, in order to allow ourselves the freedom to work with such spaces as {L^{\infty-}}, which is almost but not quite a von Neumann algebra. Instead, we use the following looser (and more algebraic) definition in Definition 14.

— 2. Limits of non-commutative random variables —

One benefit of working in an abstract setting is that it becomes easier to take certain types of limits. For instance, it is intuitively obvious that the cyclic groups {{\bf Z}/N{\bf Z}} are “converging” in some sense to the integer group {{\bf Z}}. This convergence can be formalised by selecting a distinguished generator {e} of all groups involved ({1 \hbox{ mod } N} in the case of {{\bf Z}/N{\bf Z}}, and {1} in the case of the integers {{\bf Z}}), and noting that the set of relations involving this generator in {{\bf Z}/N{\bf Z}} (i.e. the relations {ne=0} when {n} is divisible by {N}) converge in a pointwise sense to the set of relations involving this generator in {{\bf Z}} (i.e. the empty set). Here, to see the convergence, we viewed a group abstractly via the relations between its generators, rather than on a concrete realisation of a group as (say) residue classes modulo {N}. (For more discussion of this notion of convergence for finitely generated groups, see this earlier blog post.)

We can similarly define convergence of random variables in non-commutative probability spaces as follows.

Definition 20 (Convergence) Let {({\mathcal A}_n, \tau_n)} be a sequence of non-commutative probability spaces, and let {({\mathcal A}_\infty, \tau_\infty)} be an additional non-commutative space. For each {n}, let {X_{n,1},\ldots,X_{n,k}} be a sequence of random variables in {{\mathcal A}_n}, and let {X_{\infty,1},\ldots,X_{\infty,k}} be a sequence of random variables in {{\mathcal A}_\infty}. We say that {X_{n,1},\ldots,X_{n,k}} converges in the sense of moments to {X_{\infty,1},\ldots,X_{\infty,k}} if we have

\displaystyle \tau_n( X_{n,i_1} \ldots X_{n,i_m} ) \rightarrow \tau_\infty( X_{\infty,i_1} \ldots X_{\infty,i_m} )

as {n \rightarrow \infty} for any sequence {i_1,\ldots,i_m \in \{1,\ldots,k\}}. We say that {X_{n,1},\ldots,X_{n,k}} converge in the sense of {*}-moments to {X_{\infty,1},\ldots,X_{\infty,k}} if {X_{n,1},\ldots,X_{n,k},X^*_{n,1},\ldots,X^*_{n,k}} converges in the sense of moments to {X_{\infty,1},\ldots,X_{\infty,k},X^*_{\infty,1},\ldots,X^*_{\infty,k}}.

If {X_1,\ldots,X_k} (viewed as a constant {k}-tuple in {n}) converges in the sense of moments (resp. {*}-moments) to {Y_1,\ldots,Y_k}, we say that {X_1,\ldots,X_k} and {Y_1,\ldots,Y_k} have matching joint moments (resp. matching joint {*}-moments).

Example 21 If {X_n,Y_n} converge in the sense of moments to {X_\infty, Y_\infty} then we have for instance that

\displaystyle \tau_n(X_n Y_n^k X_n) \rightarrow \tau_\infty(X_\infty Y_\infty^k X_\infty)

as {n \rightarrow \infty} for each {k}, while if they converge in the stronger sense of {*}-moments then we obtain more limits, such as

\displaystyle \tau_n(X_n Y_n^k X_n^*) \rightarrow \tau_\infty(X_\infty Y_\infty^k X_\infty^*).

Note however that no uniformity in {k} is assumed for this convergence; in particular, if {k} varies in {n} (e.g. if { k= O(\log n)}), there is now no guarantee that one still has convergence.

Remark 22 When the underlying objects {X_{n,1},\ldots,X_{n,k}} and {X_1,\ldots,X_k} are self-adjoint, then there is no distinction between convergence in moments and convergence in {*}-moments. However, for non-self-adjoint variables, the latter type of convergence is far stronger, and the former type is usually too weak to be of much use, even in the commutative setting. For instance, let {X} be a classical random variable drawn uniformly at random from the unit circle {\{ z \in {\bf C}: |z|=1\}}. Then the constant sequence {X_n=X} has all the same moments as the zero random variable {0}, and thus converges in the sense of moments to zero, but does not converge in the {*}-moment sense to zero.

It is also clear that if we require that {{\mathcal A}_\infty} be generated by {X_{\infty,1},\ldots,X_{\infty,k}} in the {*}-algebraic sense (i.e. every element of {{\mathcal A}_\infty} is a polynomial combination of {X_{\infty,1},\ldots,X_{\infty,k}} and their adjoints) then a limit in the sense of {*}-moments, if it exists, is unique up to matching joint {*}-moments.

For a sequence {X_n} of a single, uniformly bounded, self-adjoint element, convergence in moments is equivalent to convergence in distribution:

Exercise 23 Let {X_n \in {\mathcal A}_n} be a sequence of self-adjoint elements in non-commutative probability spaces {({\mathcal A}_n, \tau_n)} with {\rho(X_n)} uniformly bounded, and let {X_\infty \in {\mathcal A}_\infty} be another bounded self-adjoint element in a non-commutative probability space {({\mathcal A}_\infty, \tau_\infty)}. Show that {X_n} converges in moments to {X_\infty} if and only if the spectral measure {\mu_{X_n}} converges in the vague topology to {\mu_{X_\infty}}.

Thus, for instance, one can rephrase the Wigner semi-circular law (in the convergence in expectation formulation) as the assertion that a sequence {M_n \in L^{\infty-} \otimes M_n({\bf C})} of Wigner random matrices with (say) subgaussian entries of mean zero and variance one, when viewed as elements of the non-commutative probability space {(L^{\infty-} \otimes M_n({\bf C}), \mathop{\bf E} \frac{1}{n} \hbox{tr})}, will converge to any bounded self-adjoint element {u} of a non-commutative probability space with spectral measure given by the semi-circular distribution {\mu_{sc} := \frac{1}{2\pi} (4-x^2)_+^{1/2}\ dx}. Such elements are known as semi-circular elements. Here are some easy examples of semi-circular elements:

  • A classical real random variable {u} drawn using the probability measure {\mu_{sc}}.
  • The identity function {x \mapsto x} in the Lebesgue space {L^\infty(d\mu_{sc})}, endowed with the trace {\tau(f) := \int_{\bf R} f\ d\mu_{sc}}.
  • The function {\theta \mapsto 2\cos\theta} in the Lebesgue space {L^\infty([0,\pi], \frac{2}{\pi} \sin^2 \theta\ d\theta)}.

Here is a more interesting example of a semi-circular element:

Exercise 24 Let {({\mathcal A},\tau)} be the non-commutative space consisting of bounded operators {B(\ell^2({\bf N}))} on the natural numbers with trace {\tau(X) := \langle e_0, Xe_0 \rangle_{\ell^2({\bf N})}}, where {e_0,e_1,\ldots} is the standard basis of {\ell^2({\bf N})}. Let {U: e_n \mapsto e_{n+1}} be the right shift on {\ell^2({\bf N})}. Show that {U+U^*} is a semicircular operator. (Hint: one way to proceed here is to use Fourier analysis to identify {\ell^2({\bf N})} with the space of odd functions {\theta \mapsto f(\theta)} on {{\bf R}/2\pi{\bf Z}}, with {U} being the operator that maps {\sin(n\theta)} to {\sin((n+1)\theta)}; show that {U+U^*} is then the operation of multiplication by {2\cos \theta}.) One can also interpret {U} as a creation operator in a exercise when {k} is odd. Note that this provides a (very) slightly different proof of the semi-circular law from that given from the moment method in Notes 4.

Because we are working in such an abstract setting with so few axioms, limits exist in abundance:

Exercise 25 For each {n}, let {X_{n,1},\ldots,X_{n,k}} be bounded self-adjoint elements of a tracial non-commutative space {({\mathcal A}_n, \tau_n)}. Suppose that the spectral radii {\rho(X_{n,1}),\ldots,\rho(X_{n,k})} are uniformly bounded in {n}. Show that there exists a subsequence {n_j} and bounded self-adjoint elements {X_1,\ldots,X_k} of a tracial non-commutative space {({\mathcal A}, \tau)} such that {X_{n_j,1}, \ldots, X_{n_j,k}} converge in moments to {X_1,\ldots,X_k} as {j \rightarrow \infty}. (Hint: use the Bolzano-Weierstrass theorem and the Arzelá-Ascoli diagonalisation trick to obtain a subsequence in which each of the joint moments of {X_{n_j,1},\ldots,X_{n_j,k}} converge as {j \rightarrow \infty}. Use these moments to build a noncommutative probability space.)

— 3. Free independence —

We now come to the fundamental concept in free probability theory, namely that of free independence.

Definition 26 (Free independence) A collection {X_1,\ldots,X_k} of random variables in a non-commutative probability space {({\mathcal A},\tau)} is freely independent (or free for short) if one has

\displaystyle \tau( (P_1( X_{i_1} ) - \tau(P_1(X_{i_1}))) \ldots (P_m(X_{i_m}) - \tau(P_m(X_{i_m}))) ) = 0

whenever {P_1,\ldots,P_m} are polynomials and {i_1,\ldots,i_m \in \{1,\ldots,k\}} are indices with no two adjacent {i_j} equal.

A sequence {X_{n,1},\ldots,X_{n,k}} of random variables in a non-commutative probability space {({\mathcal A}_n,\tau_n)} is asymptotically freely independent (or asymptotically free for short) if one has

\displaystyle \tau_n( (P_1( X_{n,i_1} ) - \tau(P_1(X_{n,i_1}))) \ldots (P_m(X_{n,i_m}) - \tau(P_m(X_{n,i_m}))) )

\displaystyle \rightarrow 0

as {n \rightarrow \infty} whenever {P_1,\ldots,P_m} are polynomials and {i_1,\ldots,i_m \in \{1,\ldots,k\}} are indices with no two adjacent {i_j} equal.

Remark 27 The above example describes freeness of collections of random variables {{\mathcal A}}. One can more generally define freeness of collections of subalgebras of {{\mathcal A}}, which in some sense is the more natural concept from a category-theoretic perspective, but we will not need this concept here. (See e.g. this survey of Biane for more discussion.)

Thus, for instance, if {X, Y} are freely independent, then {\tau( P(X) Q(Y) R(X) S(Y) )} will vanish for any polynomials {P,Q,R,S} for which {\tau(P(X)), \tau(Q(Y)), \tau(R(X)), \tau(S(Y))} all vanish. This is in contrast to classical independence of classical (commutative) random variables, which would only assert that {\tau(P(X) Q(Y))=0} whenever {\tau(P(X)), \tau(Q(Y))} both vanish.

To contrast free independence with classical independence, suppose that {\tau(X)=\tau(Y)=0}. If {X, Y} were freely independent, then {\tau(XYXY)=0}. If instead {X, Y} were commuting and classically independent, then we would instead have {\tau(XYXY) = \tau(X^2Y^2) = \tau(X^2)\tau(Y^2)}, which would almost certainly be non-zero.

For a trivial example of free independence, {X} and {Y} automatically are freely independent if at least one of {X, Y} is constant (i.e. a multiple of the identity {1}). In the commutative setting, this is basically the only way one can have free independence:

Exercise 28 Suppose that {X, Y} are freely independent self-adjoint elements of a faithful non-commutative probability space which also commute. Show that at least one of {X, Y} is equal to a scalar. (Hint: First normalise {X, Y} to have trace zero, and consider {\tau(XYXY)}.)

A less trivial example of free independence comes from the free group, which provides a clue as to the original motivation of this concept:

Exercise 29 Let {{\Bbb F}_2} be the free group on two generators {g_1,g_2}. Let {{\mathcal A} = B(\ell^2({\Bbb F}_2))} be the non-commutative probability space of bounded linear operators on the Hilbert space {\ell^2({\Bbb F}_2)}, with trace {\tau(X) := \langle X e_0, e_0 \rangle}, where {e_0} is the Kronecker delta function at the identity. Let {U_1, U_2 \in {\mathcal A}} be the shift operators

\displaystyle U_1 f(g) := f(g_1 g); \quad U_2 f(g) := f(g_2 g)

for {f \in \ell^2({\Bbb F}_2)} and {g \in {\Bbb F}_2}. Show that {U_1, U_2} are freely independent.

For classically independent commuting random variables {X, Y}, knowledge of the individual moments {\tau(X^k)}, {\tau(Y^k)} gave complete information on the joint moments: {\tau(X^k Y^l) = \tau(X^k) \tau(Y^l)}. The same fact is true for freely independent random variables, though the situation is more complicated. We begin with a simple case: computing {\tau(XY)} in terms of the moments of {X, Y}. From free independence we have

\displaystyle \tau( (X-\tau(X)) (Y-\tau(Y)) = 0.

Expanding this using linear nature of trace, one soon sees that

\displaystyle \tau(X Y) = \tau(X)\tau(Y). \ \ \ \ \ (12)

 

So far, this is just as with the classically independent case. Next, we consider a slightly more complicated moment, {\tau(XYX)}. If we split {Y = \tau(Y) + (Y-\tau(Y))}, we can write this as

\displaystyle \tau(XYX) = \tau(Y) \tau(X^2) + \tau(X (Y-\tau(Y)) X).

In the classically independent case, we can conclude the latter term would vanish. We cannot immediately say that in the freely independent case, because only one of the factors has mean zero. But from (12) we know that {\tau( X (Y-\tau(Y) ) = \tau( (Y-\tau(Y)) X ) = 0}. Because of this, we can expand

\displaystyle \tau(X (Y-\tau(Y)) X) = \tau( (X-\tau(X)) (Y-\tau(Y)) (X-\tau(X)) )

and now free independence does ensure that this term vanishes, and so

\displaystyle \tau(XYX) = \tau(Y) \tau(X^2). \ \ \ \ \ (13)

 

So again we have not yet deviated from the classically independent case. But now let us look at {\tau(XYXY)}. We split the second {X} into {\tau(X)} and {X-\tau(X)}. Using (12) to control the former term, we have

\displaystyle \tau(XYXY) = \tau(X)^2 \tau(Y^2) + \tau(XY(X-\tau(X))Y).

From (13) we have {\tau(Y(X-\tau(X))Y)=0}, so we have

\displaystyle \tau(XYXY) = \tau(X)^2 \tau(Y^2) + \tau((X-\tau(X))Y(X-\tau(X))Y).

Now we split {Y} into {\tau(Y)} and {Y-\tau(Y)}. Free independence eliminates all terms except

\displaystyle \tau(XYXY) = \tau(X)^2 \tau(Y^2) + \tau((X-\tau(X))\tau(Y)(X-\tau(X))\tau(Y))

which simplifies to

\displaystyle \tau(XYXY) = \tau(X)^2 \tau(Y^2) + \tau(X^2) \tau(Y)^2 - \tau(X)^2 \tau(Y)^2

which differs from the classical independence prediction of {\tau(X^2) \tau(Y^2)}.

This process can be continued:

Exercise 30 Let {X_1,\ldots,X_k} be freely independent. Show that any joint moment of {X_1,\ldots,X_k} can be expressed as a polynomial combination of the individual moments {\tau(X_i^j)} of the {X_i}. (Hint: induct on the complexity of the moment.)

The product measure construction allows us to generate classically independent random variables at will (after extending the underlying sample space): see Exercise 18 of Notes 0. There is an analogous construction, called the amalgamated free product, that allows one to generate families of freely independent random variables, each of which has a specified distribution. Let us give an illustrative special case of this construction:

Lemma 31 (Free products) For each {1 \leq i \leq k}, let {({\mathcal A}_i,\tau_i)} be a non-commutative probability space. Then there exists a non-commutative probability space {({\mathcal A},\tau)} which contain embedded copies of each of the {({\mathcal A}_i,\tau_i)}, such that whenever {X_i \in {\mathcal A}_i} for {i=1,\ldots,k}, then {X_1,\ldots,X_k} are freely independent.

Proof: (Sketch) Recall that each {{\mathcal A}_i} can be given an inner product {\langle,\rangle_{L^2({\mathcal A}_i)}}. One can then orthogonally decompose each space {{\mathcal A}_i} into the constants {{\bf C}}, plus the trace zero elements {{\mathcal A}_i^0 := \{ X \in {\mathcal A}_i: \tau(X)=0\}}.

We now form the Fock space {{\mathcal F}} to be the inner product space formed by the direct sum of tensor products

\displaystyle {\mathcal A}_{i_1}^0 \otimes \ldots \otimes {\mathcal A}_{i_m}^0 \ \ \ \ \ (14)

 

where {m \geq 0}, and {i_1,\ldots,i_m \in \{1,\ldots,k\}} are such that no adjacent pair {i_j,i_{j+1}} of the {i_1,\ldots,i_m} are equal. Each element {X_i \in {\mathcal A}_i} then acts on this Fock space by defining

\displaystyle X_i ( Y_{i_1} \otimes \ldots \times Y_{i_m} ) := X_i \otimes Y_{i_1} \otimes \ldots \otimes Y_{i_m}

when {i \neq i_1}, and

\displaystyle X_i ( Y_{i_1} \otimes \ldots \times Y_{i_m} ) := \tau( X_i Y_{i_1} ) Y_{i_2} \otimes \ldots \times Y_{i_m}

\displaystyle + (X_i Y_{i_1} - \tau(X_i Y_{i_1})) \otimes Y_{i_2} \otimes \ldots \otimes Y_{i_m}

when {i=i_1}. One can thus map {{\mathcal A}_i} into the space {{\mathcal A} := \hbox{Hom}({\mathcal F}, {\mathcal F})} of linear maps from {{\mathcal F}} to itself. The latter can be given the structure of a non-commutative space by defining the trace {\tau(X)} of an element {X \in {\mathcal A}} by the formula {\tau(X) := \langle X e_\emptyset, e_\emptyset \rangle_{{\mathcal F}}}, where {e_\emptyset} is the vacuum state of {{\mathcal F}}, being the unit of the {m=0} tensor product. One can verify (Exercise!) that {{\mathcal A}_i} embeds into {{\mathcal A}} and that elements from different {{\mathcal A}_i} are freely independent. \Box

Exercise 32 Complete the proof of Lemma 31. (Hint: you may find it helpful to first do Exercise 29, as the construction here is in an abstraction of the one in that exercise.)

Finally, we illustrate the fundamental connection between free probability and random matrices observed by Voiculescu, namely that (classically) independent families of random matrices are asymptotically free. The intuition here is that while a large random matrix {M} will certainly correlate with itself (so that, for instance, {\hbox{tr} M^* M} will be large), once one interposes an independent random matrix {N} of trace zero, the correlation is largely destroyed (thus, for instance, {\hbox{tr} M^* N M} will usually be quite small).

We give a typical instance of this phenomenon here:

Proposition 33 (Asymptotic freeness of Wigner matrices) Let {M_{n,1},\ldots,M_{n,k}} be a collection of independent {n \times n} Wigner matrices, where the coefficients all have uniformly bounded {m^{th}} moments for each {m}. Then the random variables {\frac{1}{\sqrt{n}} M_{n,1},\ldots,\frac{1}{\sqrt{n}} M_{n,k} \in (L^{\infty-} \otimes M_n({\bf C}), {\bf E}\frac{1}{n}\hbox{tr})} are asymptotically free.

Proof: (Sketch) Let us abbreviate {\frac{1}{\sqrt{n}} M_{n,j}} as {X_j} (suppressing the {n} dependence). It suffices to show that the traces

\displaystyle \tau( \prod_{j=1}^m (X_{i_j}^{a_j} - \tau(X_{i_j}^{a_j})) ) = o(1)

for each fixed choice of natural numbers {a_1,\ldots,a_m}, where no two adjacent {i_j, i_{j+1}} are equal.

Recall from Notes 3 that {\tau(X_j^{a_j})} is (up to errors of {o(1)}) equal to a normalised count of paths of length {a_j} in which each edge is traversed exactly twice, with the edges forming a tree. After normalisation, this count is equal to {0} when {a_j} is odd, and equal to the Catalan number {C_{a_j/2}} when {a_j} is even.

One can perform a similar computation to compute {\tau( \prod_{j=1}^m X_{i_j}^{a_j} )}. Up to errors of {o(1)}, this is a normalised count of coloured paths of length {a_1+\ldots+a_m}, where the first {a_1} edges are coloured with colour {i_1}, the next {a_2} with colour {i_2}, etc. Furthermore, each edge is traversed exactly twice (with the two traversals of each edge being assigned the same colour), and the edges form a tree. As a consequence, there must exist a {j} for which the block of {a_j} edges of colour {i_j} form their own sub-tree, which contributes a factor of {C_{a_j/2}} or {0} to the final trace. Because of this, when one instead computes the normalised expression {\tau( \prod_{j=1}^m (X_{i_j}^{a_j} - \tau(X_{i_j}^{a_j})) )}, all contributions that are not {o(1)} cancel themselves out, and the claim follows. \Box

Exercise 34 Expand the above sketch into a full proof of the above theorem.

Remark 35 This is by no means the only way in which random matrices can become asymptotically free. For instance, if instead one considers random matrices of the form {M_{n,i} = U_i^* A_i U_i}, where {A_i} are deterministic Hermitian matrices with uniformly bounded eigenvalues, and the {U_i} are iid unitary matrices drawn using Haar measure on the unitary group {U(n)}, one can also show that the {M_{n,i}} are asymptotically free; again, see the paper of Voiculescu for details.

— 4. Free convolution —

When one is summing two classically independent (real-valued) random variables {X} and {Y}, the distribution {\mu_{X+Y}} of the sum {X+Y} is the convolution {\mu_X * \mu_Y} of the distributions {\mu_X} and {\mu_Y}. This convolution can be computed by means of the characteristic function

\displaystyle F_X(t) := \tau(e^{itX}) = \int_{\bf R} e^{itx}\ d\mu_X(x)

by means of the simple formula

\displaystyle \tau(e^{it(X+Y)}) = \tau(e^{itX}) \tau(e^{itY}).

As we saw in Notes 2, this can be used in particular to establish a short proof of the central limit theorem.

There is an analogous theory when summing two freely independent (self-adjoint) non-commutative random variables {X} and {Y}; the distribution {\mu_{X+Y}} turns out to be a certain combination {\mu_X \boxplus \mu_Y}, known as the free convolution of {\mu_X} and {\mu_Y}. To compute this free convolution, one does not use the characteristic function; instead, the correct tool is the Stieltjes transform

\displaystyle s_X(z) := \tau( (X-z)^{-1} ) = \int_{\bf R} \frac{1}{x-z}\ d\mu_X(x)

which has already been discussed earlier.

Here’s how to use this transform to compute free convolutions. If one wishes, one can that {X} is bounded so that all series involved converge for {z} large enough, though actually the entire argument here can be performed at a purely algebraic level, using formal power series, and so the boundedness hypothesis here is not actually necessary.

The trick (which we already saw in Notes 4) is not to view {s = s_X(z)} as a function of {z}, but rather to view {z = z_X(s)} as a function of {s}. Given that one asymptotically has {s \sim -1/z} for {z}, we expect to be able to perform this inversion for {z} large and {s} close to zero; and in any event one can easily invert (8) on the level of formal power series.

With this inversion, we thus have

\displaystyle s = \tau((X-z_X(s))^{-1}) \ \ \ \ \ (15)

 

and thus

\displaystyle (X-z_X(s))^{-1} = s(1 - E_X)

for some {E_X = E_X(s)} of trace zero. Now we do some (formal) algebraic sleight of hand. We rearrange the above identity as

\displaystyle X = z_X(s) + s^{-1} (1 - E_X)^{-1}.

Similarly we have

\displaystyle Y = z_Y(s) + s^{-1} (1 - E_Y)^{-1}

and so

\displaystyle X + Y = z_X(s) + z_Y(s) + s^{-1} [ (1-E_X)^{-1} + (1-E_Y)^{-1} ].

We can combine the second two terms via the identity

\displaystyle (1-E_X)^{-1} + (1-E_Y)^{-1} = (1-E_X)^{-1} (1- E_Y + 1- E_X) (1-E_Y)^{-1}.

Meanwhile

\displaystyle 1 = (1-E_X)^{-1} (1 - E_Y - E_X + E_X E_Y) (1-E_Y)^{-1}

and so

\displaystyle X + Y = z_X(s) + z_Y(s) + s^{-1} + s^{-1} [(1-E_X)^{-1} (1 - E_X E_Y) (1-E_Y)^{-1}].

We can rearrange this a little bit as

\displaystyle (X+Y - z_X(s)-z_Y(s)-s^{-1})^{-1} = s [(1-E_Y) (1 - E_X E_Y)^{-1} (1-E_X) ].

We expand out as (formal) Neumann series:

\displaystyle (1-E_Y) (1 - E_X E_Y)^{-1} (1-E_X) = (1-E_Y) (1+E_X E_Y + E_X E_Y E_X E_Y + \ldots ) (1-E_X).

This expands out to equal {1} plus a whole string of alternating products of {E_X} and {E_Y}.

Now we use the hypothesis that {X} and {Y} are free. This easily implies that {E_X} and {E_Y} are also free. But they also have trace zero, thus by the definition of free independence, all alternating products of {E_X} and {E_Y} have zero trace. (In the case when there are an odd number of terms in the product, one can obtain this zero trace property using the cyclic property of trace and induction.) We conclude that

\displaystyle \tau( (1-E_Y) (1 - E_X E_Y)^{-1} (1-E_X) ) = 1

and so

\displaystyle \tau((X+Y - z_X(s)-z_Y(s)-s^{-1})^{-1}) = s.

Comparing this against (15) for {X+Y} we conclude that

\displaystyle z_{X+Y}(s) = z_X(s) + z_Y(s) + s^{-1}.

Thus, if we define the {R}-transform {R_X} of {X} to be (formally) given by the formula

\displaystyle R_X(s) := z_X(-s) - s^{-1}

then we have the addition formula

\displaystyle R_{X+Y} = R_X + R_Y.

Since one can recover the Stieltjes transform {s_X} (and hence the {R}-transform {R_X}) from the spectral measure {\mu_X} and vice versa, this formula (in principle, at least) lets one compute the spectral measure {\mu_{X+Y}} of {X+Y} from the spectral measures {\mu_X, \mu_Y}, thus allowing one to define free convolution.

For comparison, we have the (formal) addition formula

\displaystyle \log F_{X+Y} = \log F_X + \log F_Y

for classically independent real random variables {X, Y}. The following exercises carry this analogy a bit further.

Exercise 36 Let {X} be a classical real random variable. Working formally, show that

\displaystyle \log F_X(t) = \sum_{k=1}^\infty \frac{\kappa_{k}(X)}{k!} (it)^k

where the cumulants {\kappa_k(X)} can be reconstructed from the moments {\tau(X^k)} by the recursive formula

\displaystyle \tau(X^k) = \kappa_k(X) + \sum_{j=1}^{k-1} \kappa_j(X) \sum_{a_1+\ldots+a_j = k-j} \tau(X^{a_1+\ldots+a_j})

for {k \geq 1}. (Hint: start with the identity {\frac{d}{dt} F_X(t) = (\frac{d}{dt} \log F_X(t)) F_X(t)}.) Thus for instance {\kappa_1(X) = \tau(X)} is the expectation, {\kappa_2(X) = \tau(X^2) - \tau(X)^2} is the variance, and the third and fourth cumulants are given by the formula

\displaystyle \kappa_3(X) = \tau(X^3) - 3 \tau(X^2) \tau(X) + 2 \tau(X)^3.

\displaystyle \kappa_4(X) = \tau(X^4) - 4 \tau(X^3) \tau(X) - 3 \tau(X^2)^2 + 12 \tau(X^2) \tau(X)^2 - 6 \tau(X)^4.

Establish the additional formula

\displaystyle \tau(X^k) = \sum_{\pi} \prod_{A \in \pi} \kappa_{|A|}(X)

where {\pi} ranges over all partitions of {\{1,\ldots,k\}} into non-empty cells {A}.

Exercise 37 Let {X} be a non-commutative random variable. Working formally, show that

\displaystyle R_X(s) = \sum_{k=1}^\infty C_k(X) s^{k-1}

where the free cumulants {C_k(X)} can be reconstructed from the moments {\tau(X^k)} by the recursive formula

\displaystyle \tau(X^k) = C_k(X) + \sum_{j=1}^{k-1} C_j(X) \sum_{a_1+\ldots+a_j = k-j} \tau(X^{a_1}) \ldots \tau(X^{a_j})

for {k \geq 1}. (Hint: start with the identity {s_X(z) R_X(-s_X(z)) = 1 + z s_X(z)}.) Thus for instance {C_1(X) = \tau(X)} is the expectation, {C_2(X) = \tau(X^2) - \tau(X)^2} is the variance, and the third and fourth free cumulants are given by the formulae

\displaystyle C_3(X) = \tau(X^3) - 3 \tau(X^2) \tau(X) + 2 \tau(X)^3

\displaystyle C_4(X) = \tau(X^4) - 4 \tau(X^3) \tau(X) - 2 \tau(X^2)^2 + 10 \tau(X^2) \tau(X)^2 - 5 \tau(X)^4.

Establish the additional formula

\displaystyle \tau(X^k) = \sum_{\pi} \prod_{A \in \pi} C_{|A|}(X)

where {\pi} ranges over all partitions of {\{1,\ldots,k\}} into non-empty cells {A} which are non-crossing, which means that if {a < b < c < d} lie in {\{1,\ldots,k\}}, then it cannot be the case that {a,c} lie in one cell {A} while {b,d} lie in a distinct cell {A'}.

Remark 38 These computations illustrate a more general principle in free probability, in that the combinatorics of free probability tend to be the “non-crossing” analogue of the combinatorics of classical probability; compare with Remark 7 of Notes 3.

Remark 39 The {R}-transform allows for efficient computation of the spectral behaviour of sums {X+Y} of free random variables. There is an analogous transform, the {S}-transform, for computing the spectral behaviour (or more precisely, the joint moments) of products {XY} of free random variables; see for instance these notes of Speicher.

The {R}-transform clarifies the privileged role of the semi-circular elements:

Exercise 40 Let {u} be a semi-circular element. Show that {R_{\sqrt{t} u}(s) = t s} for any {t>0}. In particular, the free convolution of {\sqrt{t} u} and {\sqrt{t'} u} is {\sqrt{t+t'} u}.

Exercise 41 From the above exercise, we see that the effect of adding a free copy of {\sqrt{t} u} to a non-commutative random variable {X} is to shift the {R}-transform by {ts}. Explain how this is compatible with the Dyson Brownian motion computations in Notes 4.

It also gives a free analogue of the central limit theorem:

Exercise 42 (Free central limit theorem) Let {X} be a self-adjoint random variable with mean zero and variance one (i.e. {\tau(X)=0} and {\tau(X^2)=1}), and let {X_1, X_2, X_3, \ldots} be free copies of {X}. Let {S_n := (X_1+\ldots+X_n)/\sqrt{n}}. Show that the coefficients of the formal power series {R_{S_n}(s)} converge to that of the identity function {s}. Conclude that {S_n} converges in the sense of moments to a semi-circular element {u}.

The free central limit theorem implies the Wigner semi-circular law, at least for the GUE ensemble and in the sense of expectation. Indeed, if {M_n} is an {n \times n} GUE matrix, then the matrices {\frac{1}{\sqrt{n}} M_n} are a.s. uniformly bounded (by the Bai-Yin theorem, Notes 3), and so (after passing to a subsequence, if necessary), they converge in the sense of moments to some limit {u}.

On the other hand, if {M'_n} is an independent copy of {M_n}, then {M_n + M'_n \equiv \sqrt{2} M_n} from the properties of gaussians. Taking limits, we conclude that {u+u' \equiv \sqrt{2} u}, where (by Proposition 33) {u'} is a free copy of {u}. Comparing this with the free central limit theorem (or just the additivity property of {R}-transforms we see that {u} must have the semi-circular distribution. Thus the semi-circular distribution is the only possible limit point of the {\frac{1}{\sqrt{n}} M_n}, and the Wigner semi-circular law then holds (in expectation, and for GUE). Using concentration of measure, we can upgrade the convergence in expectation to a.s. convergence; using the Lindeberg replacement trick one can replace GUE with arbitrary Wigner matrices with (say) bounded coefficients; and then by using the truncation trick one can remove the boundedness hypothesis. (These latter few steps were also discussed in Notes 4.)