You are currently browsing the category archive for the ‘math.OA’ category.

One of the basic tools in modern combinatorics is the probabilistic method, introduced by Erdos, in which a deterministic solution to a given problem is shown to exist by constructing a random candidate for a solution, and showing that this candidate solves all the requirements of the problem with positive probability. When the problem requires a real-valued statistic ${X}$ to be suitably large or suitably small, the following trivial observation is often employed:

Proposition 1 (Comparison with mean) Let ${X}$ be a random real-valued variable, whose mean (or first moment) ${\mathop{\bf E} X}$ is finite. Then

$\displaystyle X \leq \mathop{\bf E} X$

with positive probability, and

$\displaystyle X \geq \mathop{\bf E} X$

with positive probability.

This proposition is usually applied in conjunction with a computation of the first moment ${\mathop{\bf E} X}$, in which case this version of the probabilistic method becomes an instance of the first moment method. (For comparison with other moment methods, such as the second moment method, exponential moment method, and zeroth moment method, see Chapter 1 of my book with Van Vu. For a general discussion of the probabilistic method, see the book by Alon and Spencer of the same name.)

As a typical example in random matrix theory, if one wanted to understand how small or how large the operator norm ${\|A\|_{op}}$ of a random matrix ${A}$ could be, one might first try to compute the expected operator norm ${\mathop{\bf E} \|A\|_{op}}$ and then apply Proposition 1; see this previous blog post for examples of this strategy (and related strategies, based on comparing ${\|A\|_{op}}$ with more tractable expressions such as the moments ${\hbox{tr} A^k}$). (In this blog post, all matrices are complex-valued.)

Recently, in their proof of the Kadison-Singer conjecture (and also in their earlier paper on Ramanujan graphs), Marcus, Spielman, and Srivastava introduced an striking new variant of the first moment method, suited in particular for controlling the operator norm ${\|A\|_{op}}$ of a Hermitian positive semi-definite matrix ${A}$. Such matrices have non-negative real eigenvalues, and so ${\|A\|_{op}}$ in this case is just the largest eigenvalue ${\lambda_1(A)}$ of ${A}$. Traditionally, one tries to control the eigenvalues through averaged statistics such as moments ${\hbox{tr} A^k = \sum_i \lambda_i(A)^k}$ or Stieltjes transforms ${\hbox{tr} (A-z)^{-1} = \sum_i (\lambda_i(A)-z)^{-1}}$; again, see this previous blog post. Here we use ${z}$ as short-hand for ${zI_d}$, where ${I_d}$ is the ${d \times d}$ identity matrix. Marcus, Spielman, and Srivastava instead rely on the interpretation of the eigenvalues ${\lambda_i(A)}$ of ${A}$ as the roots of the characteristic polynomial ${p_A(z) := \hbox{det}(z-A)}$ of ${A}$, thus

$\displaystyle \|A\|_{op} = \hbox{maxroot}( p_A ) \ \ \ \ \ (1)$

where ${\hbox{maxroot}(p)}$ is the largest real root of a non-zero polynomial ${p}$. (In our applications, we will only ever apply ${\hbox{maxroot}}$ to polynomials that have at least one real root, but for sake of completeness let us set ${\hbox{maxroot}(p)=-\infty}$ if ${p}$ has no real roots.)

Prior to the work of Marcus, Spielman, and Srivastava, I think it is safe to say that the conventional wisdom in random matrix theory was that the representation (1) of the operator norm ${\|A\|_{op}}$ was not particularly useful, due to the highly non-linear nature of both the characteristic polynomial map ${A \mapsto p_A}$ and the maximum root map ${p \mapsto \hbox{maxroot}(p)}$. (Although, as pointed out to me by Adam Marcus, some related ideas have occurred in graph theory rather than random matrix theory, for instance in the theory of the matching polynomial of a graph.) For instance, a fact as basic as the triangle inequality ${\|A+B\|_{op} \leq \|A\|_{op} + \|B\|_{op}}$ is extremely difficult to establish through (1). Nevertheless, it turns out that for certain special types of random matrices ${A}$ (particularly those in which a typical instance ${A}$ of this ensemble has a simple relationship to “adjacent” matrices in this ensemble), the polynomials ${p_A}$ enjoy an extremely rich structure (in particular, they lie in families of real stable polynomials, and hence enjoy good combinatorial interlacing properties) that can be surprisingly useful. In particular, Marcus, Spielman, and Srivastava established the following nonlinear variant of Proposition 1:

Proposition 2 (Comparison with mean) Let ${m,d \geq 1}$. Let ${A}$ be a random matrix, which is the sum ${A = \sum_{i=1}^m A_i}$ of independent Hermitian rank one ${d \times d}$ matrices ${A_i}$, each taking a finite number of values. Then

$\displaystyle \hbox{maxroot}(p_A) \leq \hbox{maxroot}( \mathop{\bf E} p_A )$

with positive probability, and

$\displaystyle \hbox{maxroot}(p_A) \geq \hbox{maxroot}( \mathop{\bf E} p_A )$

with positive probability.

We prove this proposition below the fold. The hypothesis that each ${A_i}$ only takes finitely many values is technical and can likely be relaxed substantially, but we will not need to do so here. Despite the superficial similarity with Proposition 1, the proof of Proposition 2 is quite nonlinear; in particular, one needs the interlacing properties of real stable polynomials to proceed. Another key ingredient in the proof is the observation that while the determinant ${\hbox{det}(A)}$ of a matrix ${A}$ generally behaves in a nonlinar fashion on the underlying matrix ${A}$, it becomes (affine-)linear when one considers rank one perturbations, and so ${p_A}$ depends in an affine-multilinear fashion on the ${A_1,\ldots,A_m}$. More precisely, we have the following deterministic formula, also proven below the fold:

Proposition 3 (Deterministic multilinearisation formula) Let ${A}$ be the sum of deterministic rank one ${d \times d}$ matrices ${A_1,\ldots,A_m}$. Then we have

$\displaystyle p_A(z) = \mu[A_1,\ldots,A_m](z) \ \ \ \ \ (2)$

for all ${z \in C}$, where the mixed characteristic polynomial ${\mu[A_1,\ldots,A_m](z)}$ of any ${d \times d}$ matrices ${A_1,\ldots,A_m}$ (not necessarily rank one) is given by the formula

$\displaystyle \mu[A_1,\ldots,A_m](z) \ \ \ \ \ (3)$

$\displaystyle = (\prod_{i=1}^m (1 - \frac{\partial}{\partial z_i})) \hbox{det}( z + \sum_{i=1}^m z_i A_i ) |_{z_1=\ldots=z_m=0}.$

Among other things, this formula gives a useful representation of the mean characteristic polynomial ${\mathop{\bf E} p_A}$:

Corollary 4 (Random multilinearisation formula) Let ${A}$ be the sum of jointly independent rank one ${d \times d}$ matrices ${A_1,\ldots,A_m}$. Then we have

$\displaystyle \mathop{\bf E} p_A(z) = \mu[ \mathop{\bf E} A_1, \ldots, \mathop{\bf E} A_m ](z) \ \ \ \ \ (4)$

for all ${z \in {\bf C}}$.

Proof: For fixed ${z}$, the expression ${\hbox{det}( z + \sum_{i=1}^m z_i A_i )}$ is a polynomial combination of the ${z_i A_i}$, while the differential operator ${(\prod_{i=1}^m (1 - \frac{\partial}{\partial z_i}))}$ is a linear combination of differential operators ${\frac{\partial^j}{\partial z_{i_1} \ldots \partial z_{i_j}}}$ for ${1 \leq i_1 < \ldots < i_j \leq d}$. As a consequence, we may expand (3) as a linear combination of terms, each of which is a multilinear combination of ${A_{i_1},\ldots,A_{i_j}}$ for some ${1 \leq i_1 < \ldots < i_j \leq d}$. Taking expectations of both sides of (2) and using the joint independence of the ${A_i}$, we obtain the claim. $\Box$

In view of Proposition 2, we can now hope to control the operator norm ${\|A\|_{op}}$ of certain special types of random matrices ${A}$ (and specifically, the sum of independent Hermitian positive semi-definite rank one matrices) by first controlling the mean ${\mathop{\bf E} p_A}$ of the random characteristic polynomial ${p_A}$. Pursuing this philosophy, Marcus, Spielman, and Srivastava establish the following result, which they then use to prove the Kadison-Singer conjecture:

Theorem 5 (Marcus-Spielman-Srivastava theorem) Let ${m,d \geq 1}$. Let ${v_1,\ldots,v_m \in {\bf C}^d}$ be jointly independent random vectors in ${{\bf C}^d}$, with each ${v_i}$ taking a finite number of values. Suppose that we have the normalisation

$\displaystyle \mathop{\bf E} \sum_{i=1}^m v_i v_i^* = 1$

where we are using the convention that ${1}$ is the ${d \times d}$ identity matrix ${I_d}$ whenever necessary. Suppose also that we have the smallness condition

$\displaystyle \mathop{\bf E} \|v_i\|^2 \leq \epsilon$

for some ${\epsilon>0}$ and all ${i=1,\ldots,m}$. Then one has

$\displaystyle \| \sum_{i=1}^m v_i v_i^* \|_{op} \leq (1+\sqrt{\epsilon})^2 \ \ \ \ \ (5)$

with positive probability.

Note that the upper bound in (5) must be at least ${1}$ (by taking ${v_i}$ to be deterministic) and also must be at least ${\epsilon}$ (by taking the ${v_i}$ to always have magnitude at least ${\sqrt{\epsilon}}$). Thus the bound in (5) is asymptotically tight both in the regime ${\epsilon\rightarrow 0}$ and in the regime ${\epsilon \rightarrow \infty}$; the latter regime will be particularly useful for applications to Kadison-Singer. It should also be noted that if one uses more traditional random matrix theory methods (based on tools such as Proposition 1, as well as more sophisticated variants of these tools, such as the concentration of measure results of Rudelson and Ahlswede-Winter), one obtains a bound of ${\| \sum_{i=1}^m v_i v_i^* \|_{op} \ll_\epsilon \log d}$ with high probability, which is insufficient for the application to the Kadison-Singer problem; see this article of Tropp. Thus, Theorem 5 obtains a sharper bound, at the cost of trading in “high probability” for “positive probability”.

In the paper of Marcus, Spielman and Srivastava, Theorem 5 is used to deduce a conjecture ${KS_2}$ of Weaver, which was already known to imply the Kadison-Singer conjecture; actually, a slight modification of their argument gives the paving conjecture of Kadison and Singer, from which the original Kadison-Singer conjecture may be readily deduced. We give these implications below the fold. (See also this survey article for some background on the Kadison-Singer problem.)

Let us now summarise how Theorem 5 is proven. In the spirit of semi-definite programming, we rephrase the above theorem in terms of the rank one Hermitian positive semi-definite matrices ${A_i := v_iv_i^*}$:

Theorem 6 (Marcus-Spielman-Srivastava theorem again) Let ${A_1,\ldots,A_m}$ be jointly independent random rank one Hermitian positive semi-definite ${d \times d}$ matrices such that the sum ${A :=\sum_{i=1}^m A_i}$ has mean

$\displaystyle \mathop{\bf E} A = I_d$

and such that

$\displaystyle \mathop{\bf E} \hbox{tr} A_i \leq \epsilon$

for some ${\epsilon>0}$ and all ${i=1,\ldots,m}$. Then one has

$\displaystyle \| A \|_{op} \leq (1+\sqrt{\epsilon})^2$

with positive probability.

In view of (1) and Proposition 2, this theorem follows from the following control on the mean characteristic polynomial:

Theorem 7 (Control of mean characteristic polynomial) Let ${A_1,\ldots,A_m}$ be jointly independent random rank one Hermitian positive semi-definite ${d \times d}$ matrices such that the sum ${A :=\sum_{i=1}^m A_i}$ has mean

$\displaystyle \mathop{\bf E} A = 1$

and such that

$\displaystyle \mathop{\bf E} \hbox{tr} A_i \leq \epsilon$

for some ${\epsilon>0}$ and all ${i=1,\ldots,m}$. Then one has

$\displaystyle \hbox{maxroot}(\mathop{\bf E} p_A) \leq (1 +\sqrt{\epsilon})^2.$

This result is proven using the multilinearisation formula (Corollary 4) and some convexity properties of real stable polynomials; we give the proof below the fold.

Thanks to Adam Marcus, Assaf Naor and Sorin Popa for many useful explanations on various aspects of the Kadison-Singer problem.

A finite group ${G=(G,\cdot)}$ is said to be a Frobenius group if there is a non-trivial subgroup ${H}$ of ${G}$ (known as the Frobenius complement of ${G}$) such that the conjugates ${gHg^{-1}}$ of ${H}$ are “disjoint as possible” in the sense that ${H \cap gHg^{-1} = \{1\}}$ whenever ${g \not \in H}$. This gives a decomposition

$\displaystyle G = \bigcup_{gH \in G/H} (gHg^{-1} \backslash \{1\}) \cup K \ \ \ \ \ (1)$

where the Frobenius kernel ${K}$ of ${G}$ is defined as the identity element ${1}$ together with all the non-identity elements that are not conjugate to any element of ${H}$. Taking cardinalities, we conclude that

$\displaystyle |G| = \frac{|G|}{|H|} (|H| - 1) + |K|$

and hence

$\displaystyle |H| |K| = |G|. \ \ \ \ \ (2)$

A remarkable theorem of Frobenius gives an unexpected amount of structure on ${K}$ and hence on ${G}$:

Theorem 1 (Frobenius’ theorem) Let ${G}$ be a Frobenius group with Frobenius complement ${H}$ and Frobenius kernel ${K}$. Then ${K}$ is a normal subgroup of ${G}$, and hence (by (2) and the disjointness of ${H}$ and ${K}$ outside the identity) ${G}$ is the semidirect product ${K \rtimes H}$ of ${H}$ and ${K}$.

I discussed Frobenius’ theorem and its proof in this recent blog post. This proof uses the theory of characters on a finite group ${G}$, in particular relying on the fact that a character on a subgroup ${H}$ can induce a character on ${G}$, which can then be decomposed into irreducible characters with natural number coefficients. Remarkably, even though a century has passed since Frobenius’ original argument, there is no proof known of this theorem which avoids character theory entirely; there are elementary proofs known when the complement ${H}$ has even order or when ${H}$ is solvable (we review both of these cases below the fold), which by the Feit-Thompson theorem does cover all the cases, but the proof of the Feit-Thompson theorem involves plenty of character theory (and also relies on Theorem 1). (The answers to this MathOverflow question give a good overview of the current state of affairs.)

I have been playing around recently with the problem of finding a character-free proof of Frobenius’ theorem. I didn’t succeed in obtaining a completely elementary proof, but I did find an argument which replaces character theory (which can be viewed as coming from the representation theory of the non-commutative group algebra ${{\bf C} G \equiv L^2(G)}$) with the Fourier analysis of class functions (i.e. the representation theory of the centre ${Z({\bf C} G) \equiv L^2(G)^G}$ of the group algebra), thus replacing non-commutative representation theory by commutative representation theory. This is not a particularly radical depature from the existing proofs of Frobenius’ theorem, but it did seem to be a new proof which was technically “character-free” (even if it was not all that far from character-based in spirit), so I thought I would record it here.

The main ideas are as follows. The space ${L^2(G)^G}$ of class functions can be viewed as a commutative algebra with respect to the convolution operation ${*}$; as the regular representation is unitary and faithful, this algebra contains no nilpotent elements. As such, (Gelfand-style) Fourier analysis suggests that one can analyse this algebra through the idempotents: class functions ${\phi}$ such that ${\phi*\phi = \phi}$. In terms of characters, idempotents are nothing more than sums of the form ${\sum_{\chi \in \Sigma} \chi(1) \chi}$ for various collections ${\Sigma}$ of characters, but we can perform a fair amount of analysis on idempotents directly without recourse to characters. In particular, it turns out that idempotents enjoy some important integrality properties that can be established without invoking characters: for instance, by taking traces one can check that ${\phi(1)}$ is a natural number, and more generally we will show that ${{\bf E}_{(a,b) \in S} {\bf E}_{x \in G} \phi( a x b^{-1} x^{-1} )}$ is a natural number whenever ${S}$ is a subgroup of ${G \times G}$ (see Corollary 4 below). For instance, the quantity

$\displaystyle \hbox{rank}(\phi) := {\bf E}_{a \in G} {\bf E}_{x \in G} \phi(a xa^{-1} x^{-1})$

is a natural number which we will call the rank of ${\phi}$ (as it is also the linear rank of the transformation ${f \mapsto f*\phi}$ on ${L^2(G)}$).

In the case that ${G}$ is a Frobenius group with kernel ${K}$, the above integrality properties can be used after some elementary manipulations to establish that for any idempotent ${\phi}$, the quantity

$\displaystyle \frac{1}{|G|} \sum_{a \in K} {\bf E}_{x \in G} \phi( axa^{-1}x^{-1} ) - \frac{1}{|G| |K|} \sum_{a,b \in K} \phi(ab^{-1}) \ \ \ \ \ (3)$

is an integer. On the other hand, one can also show by elementary means that this quantity lies between ${0}$ and ${\hbox{rank}(\phi)}$. These two facts are not strong enough on their own to impose much further structure on ${\phi}$, unless one restricts attention to minimal idempotents ${\phi}$. In this case spectral theory (or Gelfand theory, or the fundamental theorem of algebra) tells us that ${\phi}$ has rank one, and then the integrality gap comes into play and forces the quantity (3) to always be either zero or one. This can be used to imply that the convolution action of every minimal idempotent ${\phi}$ either preserves ${\frac{|G|}{|K|} 1_K}$ or annihilates it, which makes ${\frac{|G|}{|K|} 1_K}$ itself an idempotent, which makes ${K}$ normal.

Two weeks ago I was at Oberwolfach, for the Arbeitsgemeinschaft in Ergodic Theory and Combinatorial Number Theory that I was one of the organisers for. At this workshop, I learned the details of a very nice recent convergence result of Miguel Walsh (who, incidentally, is an informal grandstudent of mine, as his advisor, Roman Sasyk, was my informal student), which considerably strengthens and generalises a number of previous convergence results in ergodic theory (including one of my own), with a remarkably simple proof. Walsh’s argument is phrased in a finitary language (somewhat similar, in fact, to the approach used in my paper mentioned previously), and (among other things) relies on the concept of metastability of sequences, a variant of the notion of convergence which is useful in situations in which one does not expect a uniform convergence rate; see this previous blog post for some discussion of metastability. When interpreted in a finitary setting, this concept requires a fair amount of “epsilon management” to manipulate; also, Walsh’s argument uses some other epsilon-intensive finitary arguments, such as a decomposition lemma of Gowers based on the Hahn-Banach theorem. As such, I was tempted to try to rewrite Walsh’s argument in the language of nonstandard analysis to see the extent to which these sorts of issues could be managed. As it turns out, the argument gets cleaned up rather nicely, with the notion of metastability being replaced with the simpler notion of external Cauchy convergence (which we will define below the fold).

Let’s first state Walsh’s theorem. This theorem is a norm convergence theorem in ergodic theory, and can be viewed as a substantial generalisation of one of the most fundamental theorems of this type, namely the mean ergodic theorem:

Theorem 1 (Mean ergodic theorem) Let ${(X,\mu,T)}$ be a measure-preserving system (a probability space ${(X,\mu)}$ with an invertible measure-preserving transformation ${T}$). Then for any ${f \in L^2(X,\mu)}$, the averages ${\frac{1}{N} \sum_{n=1}^N T^n f}$ converge in ${L^2(X,\mu)}$ norm as ${N \rightarrow \infty}$, where ${T^n f(x) := f(T^{-n} x)}$.

In this post, all functions in ${L^2(X,\mu)}$ and similar spaces will be taken to be real instead of complex-valued for simplicity, though the extension to the complex setting is routine.

Actually, we have a precise description of the limit of these averages, namely the orthogonal projection of ${f}$ to the ${T}$-invariant factors. (See for instance my lecture notes on this theorem.) While this theorem ostensibly involves measure theory, it can be abstracted to the more general setting of unitary operators on a Hilbert space:

Theorem 2 (von Neumann mean ergodic theorem) Let ${H}$ be a Hilbert space, and let ${U: H \rightarrow H}$ be a unitary operator on ${H}$. Then for any ${f \in H}$, the averages ${\frac{1}{N} \sum_{n=1}^N U^n f}$ converge strongly in ${H}$ as ${N \rightarrow \infty}$.

Again, see my lecture notes (or just about any text in ergodic theory) for a proof.

Now we turn to Walsh’s theorem.

Theorem 3 (Walsh’s convergence theorem) Let ${(X,\mu)}$ be a measure space with a measure-preserving action of a nilpotent group ${G}$. Let ${g_1,\ldots,g_k: {\bf Z} \rightarrow G}$ be polynomial sequences in ${G}$ (i.e. each ${g_i}$ takes the form ${g_i(n) = a_{i,1}^{p_{i,1}(n)} \ldots a_{i,j}^{p_{i,j}(n)}}$ for some ${a_{i,1},\ldots,a_{i,j} \in G}$ and polynomials ${p_{i,1},\ldots,p_{i,j}: {\bf Z} \rightarrow {\bf Z}}$). Then for any ${f_1,\ldots,f_k \in L^\infty(X,\mu)}$, the averages ${\frac{1}{N} \sum_{n=1}^N (g_1(n) f_1) \ldots (g_k(n) f_k)}$ converge in ${L^2(X,\mu)}$ norm as ${N \rightarrow \infty}$, where ${g(n) f(x) := f(g(n)^{-1} x)}$.

It turns out that this theorem can also be abstracted to some extent, although due to the multiplication in the summand ${(g_1(n) f_1) \ldots (g_k(n) f_k)}$, one cannot work purely with Hilbert spaces as in the von Neumann mean ergodic theorem, but must also work with something like the Banach algebra ${L^\infty(X,\mu)}$. There are a number of ways to formulate this abstraction (which will be of some minor convenience to us, as it will allow us to reduce the need to invoke the nonstandard measure theory of Loeb, discussed for instance in this blog post); we will use the notion of a (real) commutative probability space ${({\mathcal A},\tau)}$, which for us will be a commutative unital algebra ${{\mathcal A}}$ over the reals together with a linear functional ${\tau: {\mathcal A} \rightarrow {\bf R}}$ which maps ${1}$ to ${1}$ and obeys the non-negativity axiom ${\tau(f^2) \ge 0}$ for all ${f}$. The key example to keep in mind here is ${{\mathcal A} = L^\infty(X,\mu)}$ of essentially bounded real-valued measurable functions with the supremum norm, and with the trace ${\tau(f) := \int_X f\ d\mu}$. We will also assume in our definition of commutative probability spaces that all elements ${f}$ of ${{\mathcal A}}$ are bounded in the sense that the spectral radius ${\rho(f) := \lim_{k \rightarrow \infty} \tau(f^{2k})^{1/2k}}$ is finite. (In the concrete case of ${L^\infty(X,\mu)}$, the spectral radius is just the ${L^\infty}$ norm.)

Given a commutative probability space, we can form an inner product ${\langle, \rangle_{L^2(\tau)}}$ on it by the formula

$\displaystyle \langle f, g \rangle_{L^2(\tau)} := \tau(fg).$

This is a positive semi-definite form, and gives a (possibly degenerate) inner product structure on ${{\mathcal A}}$. We could complete this structure into a Hilbert space ${L^2(\tau)}$ (after quotienting out the elements of zero norm), but we will not do so here, instead just viewing ${L^2(\tau)}$ as providing a semi-metric on ${{\mathcal A}}$. For future reference we record the inequalities

$\displaystyle \rho(fg) \leq \rho(f) \rho(g)$

$\displaystyle \rho(f+g) \leq \rho(f) + \rho(g)$

$\displaystyle \| fg\|_{L^2(\tau)} \leq \|f\|_{L^2(\tau)} \rho(g)$

for any ${f,g}$, which we will use in the sequel without further comment; see e.g. these previous blog notes for proofs. (Actually, for the purposes of proving Theorem 3, one can specialise to the ${L^\infty(X,\mu)}$ case (and ultraproducts thereof), in which case these inequalities are just the triangle and Hölder inequalities.)

The abstract version of Theorem 3 is then

Theorem 4 (Walsh’s theorem, abstract version) Let ${({\mathcal A},\tau)}$ be a commutative probability space, and let ${G}$ be a nilpotent group acting on ${{\mathcal A}}$ by isomorphisms (preserving the algebra, conjugation, and trace structure, and thus also preserving the spectral radius and ${L^2(\tau)}$ norm). Let ${g_1,\ldots,g_k: {\bf Z} \rightarrow G}$ be polynomial sequences. Then for any ${f_1,\ldots,f_k \in {\mathcal A}}$, the averages ${\frac{1}{N} \sum_{n=1}^N (g_1(n) f_1) \ldots (g_k(n) f_k)}$ form a Cauchy sequence in ${L^2(\tau)}$ (semi-)norm as ${N \rightarrow \infty}$.

It is easy to see that this theorem generalises Theorem 3. Conversely, one can use the commutative Gelfand-Naimark theorem to deduce Theorem 4 from Theorem 3, although we will not need this implication. Note how we are abandoning all attempts to discern what the limit of the sequence actually is, instead contenting ourselves with demonstrating that it is merely a Cauchy sequence. With this phrasing, it is tempting to ask whether there is any analogue of Walsh’s theorem for noncommutative probability spaces, but unfortunately the answer to that question is negative for all but the simplest of averages, as was worked out in this paper of Austin, Eisner, and myself.

Our proof of Theorem 4 will proceed as follows. Firstly, in order to avoid the epsilon management alluded to earlier, we will take an ultraproduct to rephrase the theorem in the language of nonstandard analysis; for reasons that will be clearer later, we will also convert the convergence problem to a problem of obtaining metastability (external Cauchy convergence). Then, we observe that (the nonstandard counterpart of) the expression ${\|\frac{1}{N} \sum_{n=1}^N (g_1(n) f_1) \ldots (g_k(n) f_k)\|_{L^2(\tau)}^2}$ can be viewed as the inner product of (say) ${f_k}$ with a certain type of expression, which we call a dual function. By performing an orthogonal projection to the span of the dual functions, we can split ${f_k}$ into the sum of an expression orthogonal to all dual functions (the “pseudorandom” component), and a function that can be well approximated by finite linear combinations of dual functions (the “structured” component). The contribution of the pseudorandom component is asymptotically negligible, so we can reduce to consideration of the structured component. But by a little bit of rearrangement, this can be viewed as an average of expressions similar to the initial average ${\frac{1}{N} \sum_{n=1}^N (g_1(n) f_1) \ldots (g_k(n) f_k)}$, except with the polynomials ${g_1,\ldots,g_k}$ replaced by a “lower complexity” set of such polynomials, which can be greater in number, but which have slightly lower degrees in some sense. One can iterate this (using “PET induction”) until all the polynomials become trivial, at which point the claim follows.

One of the basic problems in the field of operator algebras is to develop a functional calculus for either a single operator ${A}$, or a collection ${A_1, A_2, \ldots, A_k}$ of operators. These operators could in principle act on any function space, but typically one either considers complex matrices (which act on a complex finite dimensional space), or operators (either bounded or unbounded) on a complex Hilbert space. (One can of course also obtain analogous results for real operators, but we will work throughout with complex operators in this post.)

Roughly speaking, a functional calculus is a way to assign an operator ${F(A)}$ or ${F(A_1,\ldots,A_k)}$ to any function ${F}$ in a suitable function space, which is linear over the complex numbers, preserve the scalars (i.e. ${c(A) = c}$ when ${c \in {\bf C}}$), and should be either an exact or approximate homomorphism in the sense that

$\displaystyle FG(A_1,\ldots,A_k) = F(A_1,\ldots,A_k) G(A_1,\ldots,A_k), \ \ \ \ \ (1)$

should hold either exactly or approximately. In the case when the ${A_i}$ are self-adjoint operators acting on a Hilbert space (or Hermitian matrices), one often also desires the identity

$\displaystyle \overline{F}(A_1,\ldots,A_k) = F(A_1,\ldots,A_k)^* \ \ \ \ \ (2)$

to also hold either exactly or approximately. (Note that one cannot reasonably expect (1) and (2) to hold exactly for all ${F,G}$ if the ${A_1,\ldots,A_k}$ and their adjoints ${A_1^*,\ldots,A_k^*}$ do not commute with each other, so in those cases one has to be willing to allow some error terms in the above wish list of properties of the calculus.) Ideally, one should also be able to relate the operator norm of ${f(A)}$ or ${f(A_1,\ldots,A_k)}$ with something like the uniform norm on ${f}$. In principle, the existence of a good functional calculus allows one to manipulate operators as if they were scalars (or at least approximately as if they were scalars), which is very helpful for a number of applications, such as partial differential equations, spectral theory, noncommutative probability, and semiclassical mechanics. A functional calculus for multiple operators ${A_1,\ldots,A_k}$ can be particularly valuable as it allows one to treat ${A_1,\ldots,A_k}$ as being exact or approximate scalars simultaneously. For instance, if one is trying to solve a linear differential equation that can (formally at least) be expressed in the form

$\displaystyle F(X,D) u = f$

for some data ${f}$, unknown function ${u}$, some differential operators ${X,D}$, and some nice function ${F}$, then if one’s functional calculus is good enough (and ${F}$ is suitably “elliptic” in the sense that it does not vanish or otherwise degenerate too often), one should be able to solve this equation either exactly or approximately by the formula

$\displaystyle u = F^{-1}(X,D) f,$

which is of course how one would solve this equation if one pretended that the operators ${X,D}$ were in fact scalars. Formalising this calculus rigorously leads to the theory of pseudodifferential operators, which allows one to (approximately) solve or at least simplify a much wider range of differential equations than one what can achieve with more elementary algebraic transformations (e.g. integrating factors, change of variables, variation of parameters, etc.). In quantum mechanics, a functional calculus that allows one to treat operators as if they were approximately scalar can be used to rigorously justify the correspondence principle in physics, namely that the predictions of quantum mechanics approximate that of classical mechanics in the semiclassical limit ${\hbar \rightarrow 0}$.

There is no universal functional calculus that works in all situations; the strongest functional calculi, which are close to being an exact *-homomorphisms on very large class of functions, tend to only work for under very restrictive hypotheses on ${A}$ or ${A_1,\ldots,A_k}$ (in particular, when ${k > 1}$, one needs the ${A_1,\ldots,A_k}$ to commute either exactly, or very close to exactly), while there are weaker functional calculi which have fewer nice properties and only work for a very small class of functions, but can be applied to quite general operators ${A}$ or ${A_1,\ldots,A_k}$. In some cases the functional calculus is only formal, in the sense that ${f(A)}$ or ${f(A_1,\ldots,A_k)}$ has to be interpreted as an infinite formal series that does not converge in a traditional sense. Also, when one wishes to select a functional calculus on non-commuting operators ${A_1,\ldots,A_k}$, there is a certain amount of non-uniqueness: one generally has a number of slightly different functional calculi to choose from, which generally have the same properties but differ in some minor technical details (particularly with regards to the behaviour of “lower order” components of the calculus). This is a similar to how one has a variety of slightly different coordinate systems available to parameterise a Riemannian manifold or Lie group. This is on contrast to the ${k=1}$ case when the underlying operator ${A = A_1}$ is (essentially) normal (so that ${A}$ commutes with ${A^*}$); in this special case (which includes the important subcases when ${A}$ is unitary or (essentially) self-adjoint), spectral theory gives us a canonical and very powerful functional calculus which can be used without further modification in applications.

Despite this lack of uniqueness, there is one standard choice for a functional calculus available for general operators ${A_1,\ldots,A_k}$, namely the Weyl functional calculus; it is analogous in some ways to normal coordinates for Riemannian manifolds, or exponential coordinates of the first kind for Lie groups, in that it treats lower order terms in a reasonably nice fashion. (But it is important to keep in mind that, like its analogues in Riemannian geometry or Lie theory, there will be some instances in which the Weyl calculus is not the optimal calculus to use for the application at hand.)

I decided to write some notes on the Weyl functional calculus (also known as Weyl quantisation), and to sketch the applications of this calculus both to the theory of pseudodifferential operators. They are mostly for my own benefit (so that I won’t have to redo these particular calculations again), but perhaps they will also be of interest to some readers here. (Of course, this material is also covered in many other places. e.g. Folland’s “harmonic analysis in phase space“.)

Let ${A, B}$ be two Hermitian ${n \times n}$ matrices. When ${A}$ and ${B}$ commute, we have the identity

$\displaystyle e^{A+B} = e^A e^B.$

When ${A}$ and ${B}$ do not commute, the situation is more complicated; we have the Baker-Campbell-Hausdorff formula

$\displaystyle e^{A+B} = e^A e^B e^{-\frac{1}{2}[A,B]} \ldots$

where the infinite product here is explicit but very messy. On the other hand, taking determinants we still have the identity

$\displaystyle \hbox{det}(e^{A+B}) = \hbox{det}(e^A e^B).$

Recently I learned (from Emmanuel Candes, who in turn learned it from David Gross) that there is another very nice relationship between ${e^{A+B}}$ and ${e^A e^B}$, namely the Golden-Thompson inequality

$\displaystyle \hbox{tr}(e^{A+B}) \leq \hbox{tr}(e^A e^B). \ \ \ \ \ (1)$

The remarkable thing about this inequality is that no commutativity hypotheses whatsoever on the matrices ${A, B}$ are required. Note that the right-hand side can be rearranged using the cyclic property of trace as ${\hbox{tr}( e^{B/2} e^A e^{B/2} )}$; the expression inside the trace is positive definite so the right-hand side is positive. (On the other hand, there is no reason why expressions such as ${\hbox{tr}(e^A e^B e^C)}$ need to be positive or even real, so the obvious extension of the Golden-Thompson inequality to three or more Hermitian matrices fails.) I am told that this inequality is quite useful in statistical mechanics, although I do not know the details of this.

To get a sense of how delicate the Golden-Thompson inequality is, let us expand both sides to fourth order in ${A, B}$. The left-hand side expands as

$\displaystyle \hbox{tr} 1 + \hbox{tr} (A+B) + \frac{1}{2} \hbox{tr} (A^2 + AB + BA + B^2) + \frac{1}{6} \hbox{tr} (A+B)^3$

$\displaystyle + \frac{1}{24} \hbox{tr} (A+B)^4 + \ldots$

while the right-hand side expands as

$\displaystyle \hbox{tr} 1 + \hbox{tr} (A+B) + \frac{1}{2} \hbox{tr} (A^2 + 2AB + B^2)$

$\displaystyle + \frac{1}{6} \hbox{tr} (A^3 + 3A^2 B + 3 A B^2+B^3) +$

$\displaystyle \frac{1}{24} \hbox{tr} (A^4 + 4 A^3 B + 6 A^2 B^2 + 4 A B^3 +B^4) + \ldots$

Using the cyclic property of trace ${\hbox{tr}(AB) = \hbox{tr}(BA)}$, one can verify that all terms up to third order agree. Turning to the fourth order terms, one sees after expanding out ${(A+B)^4}$ and using the cyclic property of trace as much as possible, we see that the fourth order terms almost agree, but the left-hand side contains a term ${\frac{1}{12} \hbox{tr}(ABAB)}$ whose counterpart on the right-hand side is ${\frac{1}{12} \hbox{tr}(ABBA)}$. The difference between the two can be factorised (again using the cyclic property of trace) as ${-\frac{1}{24} \hbox{tr} [A,B]^2}$. Since ${[A,B] := AB-BA}$ is skew-Hermitian, ${-[A,B]^2}$ is positive definite, and so we have proven the Golden-Thompson inequality to fourth order. (One could also have used the Cauchy-Schwarz inequality for the Frobenius norm to establish this; see below.)

Intuitively, the Golden-Thompson inequality is asserting that interactions between a pair ${A, B}$ of non-commuting Hermitian matrices are strongest when cross-interactions are kept to a minimum, so that all the ${A}$ factors lie on one side of a product and all the ${B}$ factors lie on the other. Indeed, this theme will be running through the proof of this inequality, to which we now turn.

In this final set of lecture notes for this course, we leave the realm of self-adjoint matrix ensembles, such as Wigner random matrices, and consider instead the simplest examples of non-self-adjoint ensembles, namely the iid matrix ensembles. (I had also hoped to discuss recent progress in eigenvalue spacing distributions of Wigner matrices, but have run out of time. For readers interested in this topic, I can recommend the recent Bourbaki exposé of Alice Guionnet.)

The basic result in this area is

Theorem 1 (Circular law) Let ${M_n}$ be an ${n \times n}$ iid matrix, whose entries ${\xi_{ij}}$, ${1 \leq i,j \leq n}$ are iid with a fixed (complex) distribution ${\xi_{ij} \equiv \xi}$ of mean zero and variance one. Then the spectral measure ${\mu_{\frac{1}{\sqrt{n}}M_n}}$ converges both in probability and almost surely to the circular law ${\mu_{circ} := \frac{1}{\pi} 1_{|x|^2+|y|^2 \leq 1}\ dx dy}$, where ${x, y}$ are the real and imaginary coordinates of the complex plane.

This theorem has a long history; it is analogous to the semi-circular law, but the non-Hermitian nature of the matrices makes the spectrum so unstable that key techniques that are used in the semi-circular case, such as truncation and the moment method, no longer work; significant new ideas are required. In the case of random gaussian matrices, this result was established by Mehta (in the complex case) and by Edelman (in the real case), as was sketched out in Notes. In 1984, Girko laid out a general strategy for establishing the result for non-gaussian matrices, which formed the base of all future work on the subject; however, a key ingredient in the argument, namely a bound on the least singular value of shifts ${\frac{1}{\sqrt{n}} M_n - zI}$, was not fully justified at the time. A rigorous proof of the circular law was then established by Bai, assuming additional moment and boundedness conditions on the individual entries. These additional conditions were then slowly removed in a sequence of papers by Gotze-Tikhimirov, Girko, Pan-Zhou, and Tao-Vu, with the last moment condition being removed in a paper of myself, Van Vu, and Manjunath Krishnapur.

At present, the known methods used to establish the circular law for general ensembles rely very heavily on the joint independence of all the entries. It is a key challenge to see how to weaken this joint independence assumption.

In the foundations of modern probability, as laid out by Kolmogorov, the basic objects of study are constructed in the following order:

1. Firstly, one selects a sample space ${\Omega}$, whose elements ${\omega}$ represent all the possible states that one’s stochastic system could be in.
2. Then, one selects a ${\sigma}$-algebra ${{\mathcal B}}$ of events ${E}$ (modeled by subsets of ${\Omega}$), and assigns each of these events a probability ${{\bf P}(E) \in [0,1]}$ in a countably additive manner, so that the entire sample space has probability ${1}$.
3. Finally, one builds (commutative) algebras of random variables ${X}$ (such as complex-valued random variables, modeled by measurable functions from ${\Omega}$ to ${{\bf C}}$), and (assuming suitable integrability or moment conditions) one can assign expectations ${\mathop{\bf E} X}$ to each such random variable.

In measure theory, the underlying measure space ${\Omega}$ plays a prominent foundational role, with the measurable sets and measurable functions (the analogues of the events and the random variables) always being viewed as somehow being attached to that space. In probability theory, in contrast, it is the events and their probabilities that are viewed as being fundamental, with the sample space ${\Omega}$ being abstracted away as much as possible, and with the random variables and expectations being viewed as derived concepts. See Notes 0 for further discussion of this philosophy.

However, it is possible to take the abstraction process one step further, and view the algebra of random variables and their expectations as being the foundational concept, and ignoring both the presence of the original sample space, the algebra of events, or the probability measure.

There are two reasons for wanting to shed (or abstract away) these previously foundational structures. Firstly, it allows one to more easily take certain types of limits, such as the large ${n}$ limit ${n \rightarrow \infty}$ when considering ${n \times n}$ random matrices, because quantities built from the algebra of random variables and their expectations, such as the normalised moments of random matrices tend to be quite stable in the large ${n}$ limit (as we have seen in previous notes), even as the sample space and event space varies with ${n}$. (This theme of using abstraction to facilitate the taking of the large ${n}$ limit also shows up in the application of ergodic theory to combinatorics via the correspondence principle; see this previous blog post for further discussion.)

Secondly, this abstract formalism allows one to generalise the classical, commutative theory of probability to the more general theory of non-commutative probability theory, which does not have a classical underlying sample space or event space, but is instead built upon a (possibly) non-commutative algebra of random variables (or “observables”) and their expectations (or “traces”). This more general formalism not only encompasses classical probability, but also spectral theory (with matrices or operators taking the role of random variables, and the trace taking the role of expectation), random matrix theory (which can be viewed as a natural blend of classical probability and spectral theory), and quantum mechanics (with physical observables taking the role of random variables, and their expected value on a given quantum state being the expectation). It is also part of a more general “non-commutative way of thinking” (of which non-commutative geometry is the most prominent example), in which a space is understood primarily in terms of the ring or algebra of functions (or function-like objects, such as sections of bundles) placed on top of that space, and then the space itself is largely abstracted away in order to allow the algebraic structures to become less commutative. In short, the idea is to make algebra the foundation of the theory, as opposed to other possible choices of foundations such as sets, measures, categories, etc..

[Note that this foundational preference is to some extent a metamathematical one rather than a mathematical one; in many cases it is possible to rewrite the theory in a mathematically equivalent form so that some other mathematical structure becomes designated as the foundational one, much as probability theory can be equivalently formulated as the measure theory of probability measures. However, this does not negate the fact that a different choice of foundations can lead to a different way of thinking about the subject, and thus to ask a different set of questions and to discover a different set of proofs and solutions. Thus it is often of value to understand multiple foundational perspectives at once, to get a truly stereoscopic view of the subject.]

It turns out that non-commutative probability can be modeled using operator algebras such as ${C^*}$-algebras, von Neumann algebras, or algebras of bounded operators on a Hilbert space, with the latter being accomplished via the Gelfand-Naimark-Segal construction. We will discuss some of these models here, but just as probability theory seeks to abstract away its measure-theoretic models, the philosophy of non-commutative probability is also to downplay these operator algebraic models once some foundational issues are settled.

When one generalises the set of structures in one’s theory, for instance from the commutative setting to the non-commutative setting, the notion of what it means for a structure to be “universal”, “free”, or “independent” can change. The most familiar example of this comes from group theory. If one restricts attention to the category of abelian groups, then the “freest” object one can generate from two generators ${e,f}$ is the free abelian group of commutative words ${e^n f^m}$ with ${n,m \in {\bf Z}}$, which is isomorphic to the group ${{\bf Z}^2}$. If however one generalises to the non-commutative setting of arbitrary groups, then the “freest” object that can now be generated from two generators ${e,f}$ is the free group ${{\Bbb F}_2}$ of non-commutative words ${e^{n_1} f^{m_1} \ldots e^{n_k} f^{m_k}}$ with ${n_1,m_1,\ldots,n_k,m_k \in {\bf Z}}$, which is a significantly larger extension of the free abelian group ${{\bf Z}^2}$.

Similarly, when generalising classical probability theory to non-commutative probability theory, the notion of what it means for two or more random variables to be independent changes. In the classical (commutative) setting, two (bounded, real-valued) random variables ${X, Y}$ are independent if one has

$\displaystyle \mathop{\bf E} f(X) g(Y) = 0$

whenever ${f, g: {\bf R} \rightarrow {\bf R}}$ are well-behaved functions (such as polynomials) such that all of ${\mathop{\bf E} f(X)}$, ${\mathop{\bf E} g(Y)}$ vanishes. In the non-commutative setting, one can generalise the above definition to two commuting bounded self-adjoint variables; this concept is useful for instance in quantum probability, which is an abstraction of the theory of observables in quantum mechanics. But for two (bounded, self-adjoint) non-commutative random variables ${X, Y}$, the notion of classical independence no longer applies. As a substitute, one can instead consider the notion of being freely independent (or free for short), which means that

$\displaystyle \mathop{\bf E} f_1(X) g_1(Y) \ldots f_k(X) g_k(Y) = 0$

whenever ${f_1,g_1,\ldots,f_k,g_k: {\bf R} \rightarrow {\bf R}}$ are well-behaved functions such that all of ${\mathop{\bf E} f_1(X), \mathop{\bf E} g_1(Y), \ldots, \mathop{\bf E} f_k(X), \mathop{\bf E} g_k(Y)}$ vanish.

The concept of free independence was introduced by Voiculescu, and its study is now known as the subject of free probability. We will not attempt a systematic survey of this subject here; for this, we refer the reader to the surveys of Speicher and of Biane. Instead, we shall just discuss a small number of topics in this area to give the flavour of the subject only.

The significance of free probability to random matrix theory lies in the fundamental observation that random matrices which are independent in the classical sense, also tend to be independent in the free probability sense, in the large ${n}$ limit ${n \rightarrow \infty}$. (This is only possible because of the highly non-commutative nature of these matrices; as we shall see, it is not possible for non-trivial commuting independent random variables to be freely independent.) Because of this, many tedious computations in random matrix theory, particularly those of an algebraic or enumerative combinatorial nature, can be done more quickly and systematically by using the framework of free probability, which by design is optimised for algebraic tasks rather than analytical ones.

Much as free groups are in some sense “maximally non-commutative”, freely independent random variables are about as far from being commuting as possible. For instance, if ${X, Y}$ are freely independent and of expectation zero, then ${\mathop{\bf E} XYXY}$ vanishes, but ${\mathop{\bf E} XXYY}$ instead factors as ${(\mathop{\bf E} X^2) (\mathop{\bf E} Y^2)}$. As a consequence, the behaviour of freely independent random variables can be quite different from the behaviour of their classically independent commuting counterparts. Nevertheless there is a remarkably strong analogy between the two types of independence, in that results which are true in the classically independent case often have an interesting analogue in the freely independent setting. For instance, the central limit theorem (Notes 2) for averages of classically independent random variables, which roughly speaking asserts that such averages become gaussian in the large ${n}$ limit, has an analogue for averages of freely independent variables, the free central limit theorem, which roughly speaking asserts that such averages become semicircular in the large ${n}$ limit. One can then use this theorem to provide yet another proof of Wigner’s semicircle law (Notes 4).

Another important (and closely related) analogy is that while the distribution of sums of independent commutative random variables can be quickly computed via the characteristic function (i.e. the Fourier transform of the distribution), the distribution of sums of freely independent non-commutative random variables can be quickly computed using the Stieltjes transform instead (or with closely related objects, such as the ${R}$-transform of Voiculescu). This is strongly reminiscent of the appearance of the Stieltjes transform in random matrix theory, and indeed we will see many parallels between the use of the Stieltjes transform here and in Notes 4.

As mentioned earlier, free probability is an excellent tool for computing various expressions of interest in random matrix theory, such as asymptotic values of normalised moments in the large ${n}$ limit ${n \rightarrow \infty}$. Nevertheless, as it only covers the asymptotic regime in which ${n}$ is sent to infinity while holding all other parameters fixed, there are some aspects of random matrix theory to which the tools of free probability are not sufficient by themselves to resolve (although it can be possible to combine free probability theory with other tools to then answer these questions). For instance, questions regarding the rate of convergence of normalised moments as ${n \rightarrow \infty}$ are not directly answered by free probability, though if free probability is combined with tools such as concentration of measure (Notes 1) then such rate information can often be recovered. For similar reasons, free probability lets one understand the behaviour of ${k^{th}}$ moments as ${n \rightarrow \infty}$ for fixed ${k}$, but has more difficulty dealing with the situation in which ${k}$ is allowed to grow slowly in ${n}$ (e.g. ${k = O(\log n)}$). Because of this, free probability methods are effective at controlling the bulk of the spectrum of a random matrix, but have more difficulty with the edges of that spectrum (as well as with related concepts such as the operator norm, Notes 3) as well as with fine-scale structure of the spectrum. Finally, free probability methods are most effective when dealing with matrices that are Hermitian with bounded operator norm, largely because the spectral theory of bounded self-adjoint operators in the infinite-dimensional setting of the large ${n}$ limit is non-pathological. (This is ultimately due to the stable nature of eigenvalues in the self-adjoint setting; see this previous blog post for discussion.) For non-self-adjoint operators, free probability needs to be augmented with additional tools, most notably by bounds on least singular values, in order to recover the required stability for the various spectral data of random matrices to behave continuously with respect to the large ${n}$ limit. We will discuss this latter point in a later set of notes.

Tim Austin, Tanja Eisner, and I have just uploaded to the arXiv our joint paper Nonconventional ergodic averages and multiple recurrence for von Neumann dynamical systems, submitted to Pacific Journal of Mathematics. This project started with the observation that the multiple recurrence theorem of Furstenberg (and the related multiple convergence theorem of Host and Kra) could be interpreted in the language of dynamical systems of commutative finite von Neumann algebras, which naturally raised the question of the extent to which the results hold in the noncommutative setting. The short answer is “yes for small averages, but not for long ones”.

The Furstenberg multiple recurrence theorem can be phrased as follows: if ${X = (X, {\mathcal X}, \mu)}$ is a probability space with a measure-preserving shift ${T:X \rightarrow X}$ (which naturally induces an isomorphism ${\alpha: L^\infty(X) \rightarrow L^\infty(X)}$ by setting ${\alpha a := a \circ T^{-1}}$), ${a \in L^\infty(X)}$ is non-negative with positive trace ${\tau(a) := \int_X a\ d\mu}$, and ${k \geq 1}$ is an integer, then one has

$\displaystyle \liminf_{N \rightarrow \infty} \frac{1}{N} \sum_{n=1}^N \tau( a (\alpha^n a) \ldots (\alpha^{(k-1)n} a) ) > 0.$

In particular, ${\tau( a (\alpha^n a) \ldots (\alpha^{(k-1)n} a) ) > 0}$ for all ${n}$ in a set of positive upper density. This result is famously equivalent to Szemerédi’s theorem on arithmetic progressions.

The Host-Kra multiple convergence theorem makes the related assertion that if ${a_0,\ldots,a_{k-1} \in L^\infty(X)}$, then the scalar averages

$\displaystyle \frac{1}{N} \sum_{n=1}^N \tau( a_0 (\alpha^n a_1) \ldots (\alpha^{(k-1)n} a_{k-1}) )$

converge to a limit as ${N \rightarrow \infty}$; a fortiori, the function averages

$\displaystyle \frac{1}{N} \sum_{n=1}^N (\alpha^n a_1) \ldots (\alpha^{(k-1)n} a_{k-1})$

converge in (say) ${L^2(X)}$ norm.

The space ${L^\infty(X)}$ is a commutative example of a von Neumann algebra: an algebra of bounded linear operators on a complex Hilbert space ${H}$ which is closed under the weak operator topology, and under taking adjoints. Indeed, one can take ${H}$ to be ${L^2(X)}$, and identify each element ${m}$ of ${L^\infty(X)}$ with the multiplier operator ${a \mapsto ma}$. The operation ${\tau: a \mapsto \int_X a\ d\mu}$ is then a finite trace for this algebra, i.e. a linear map from the algebra to the scalars ${{\mathbb C}}$ such that ${\tau(ab)=\tau(ba)}$, ${\tau(a^*) = \overline{\tau(a)}}$, and ${\tau(a^* a) \geq 0}$, with equality iff ${a=0}$. The shift ${\alpha: L^\infty(X) \rightarrow L^\infty(X)}$ is then an automorphism of this algebra (preserving shift and conjugation).

We can generalise this situation to the noncommutative setting. Define a von Neumann dynamical system ${(M, \tau, \alpha)}$ to be a von Neumann algebra ${M}$ with a finite trace ${\tau}$ and an automorphism ${\alpha: M \rightarrow M}$. In addition to the commutative examples generated by measure-preserving systems, we give three other examples here:

• (Matrices) ${M = M_n({\mathbb C})}$ is the algebra of ${n \times n}$ complex matrices, with trace ${\tau(a) = \frac{1}{n} \hbox{tr}(a)}$ and shift ${\alpha(a) := UaU^{-1}}$, where ${U}$ is a fixed unitary ${n \times n}$ matrix.
• (Group algebras) ${M = \overline{{\mathbb C} G}}$ is the closure of the group algebra ${{\mathbb C} G}$ of a discrete group ${G}$ (i.e. the algebra of finite formal complex combinations of group elements), which acts on the Hilbert space ${\ell^2(G)}$ by convolution (identifying each group element with its Kronecker delta function). A trace is given by ${\alpha(a) = \langle a \delta_0, \delta_0 \rangle_{\ell^2(G)}}$, where ${\delta_0 \in \ell^2(G)}$ is the Kronecker delta at the identity. Any automorphism ${T: G \rightarrow G}$ of the group induces a shift ${\alpha: M \rightarrow M}$.
• (Noncommutative torus) ${M}$ is the von Neumann algebra acting on ${L^2(({\mathbb R}/{\mathbb Z})^2)}$ generated by the multiplier operator ${f(x,y) \mapsto e^{2\pi i x} f(x,y)}$ and the shifted multiplier operator ${f(x,y) \mapsto e^{2\pi i y} f(x+\alpha,y)}$, where ${\alpha \in {\mathbb R}/{\mathbb Z}}$ is fixed. A trace is given by ${\alpha(a) = \langle 1, a1\rangle_{L^2(({\mathbb R}/{\mathbb Z})^2)}}$, where ${1 \in L^2(({\mathbb R}/{\mathbb Z})^2)}$ is the constant function.

Inspired by noncommutative generalisations of other results in commutative analysis, one can then ask the following questions, for a fixed ${k \geq 1}$ and for a fixed von Neumann dynamical system ${(M,\tau,\alpha)}$:

• (Recurrence on average) Whenever ${a \in M}$ is non-negative with positive trace, is it true that$\displaystyle \liminf_{N \rightarrow \infty} \frac{1}{N} \sum_{n=1}^N \tau( a (\alpha^n a) \ldots (\alpha^{(k-1)n} a) ) > 0?$
• (Recurrence on a dense set) Whenever ${a \in M}$ is non-negative with positive trace, is it true that$\displaystyle \tau( a (\alpha^n a) \ldots (\alpha^{(k-1)n} a) ) > 0$for all ${n}$ in a set of positive upper density?
• (Weak convergence) With ${a_0,\ldots,a_{k-1} \in M}$, is it true that$\displaystyle \frac{1}{N} \sum_{n=1}^N \tau( a_0 (\alpha^n a_1) \ldots (\alpha^{(k-1)n} a_{k-1}) )$converges?
• (Strong convergence) With ${a_1,\ldots,a_{k-1} \in M}$, is it true that$\displaystyle \frac{1}{N} \sum_{n=1}^N (\alpha^n a_1) \ldots (\alpha^{(k-1)n} a_{k-1})$converges in using the Hilbert-Schmidt norm ${\|a\|_{L^2(M)} := \tau(a^* a)^{1/2}}$?

Note that strong convergence automatically implies weak convergence, and recurrence on average automatically implies recurrence on a dense set.

For ${k=1}$, all four questions can trivially be answered “yes”. For ${k=2}$, the answer to the above four questions is also “yes”, thanks to the von Neumann ergodic theorem for unitary operators. For ${k=3}$, we were able to establish a positive answer to the “recurrence on a dense set”, “weak convergence”, and “strong convergence” results assuming that ${M}$ is ergodic. For general ${k}$, we have a positive answer to all four questions under the assumption that ${M}$ is asymptotically abelian, which roughly speaking means that the commutators ${[a,\alpha^n b]}$ converges to zero (in an appropriate weak sense) as ${n \rightarrow \infty}$. Both of these proofs adapt the usual ergodic theory arguments; the latter result generalises some earlier work of Niculescu-Stroh-Zsido, Duvenhage, and Beyers-Duvenhage-Stroh. For the ${k=3}$ result, a key observation is that the van der Corput lemma can be used to control triple averages without requiring any commutativity; the “generalised von Neumann” trick of using multiple applications of the van der Corput trick to control higher averages, however, relies much more strongly on commutativity.

In most other situations we have counterexamples to all of these questions. In particular:

• For ${k=3}$, recurrence on average can fail on an ergodic system; indeed, one can even make the average negative. This example is ultimately based on a Behrend example construction and a von Neumann algebra construction known as the crossed product.
• For ${k=3}$, recurrence on a dense set can also fail if the ergodicity hypothesis is dropped. This also uses the Behrend example and the crossed product construction.
• For ${k=4}$, weak and strong convergence can fail even assuming ergodicity. This uses a group theoretic construction, which amusingly was inspired by Grothendieck’s interpretation of a group as a sheaf of flat connections, which I blogged about recently, and which I will discuss below the fold.
• For ${k=5}$, recurrence on a dense set fails even with the ergodicity hypothesis. This uses a fancier version of the Behrend example due to Ruzsa in this paper of Bergelson, Host, and Kra. This example only applies for ${k \geq 5}$; we do not know for ${k=4}$ whether recurrence on a dense set holds for ergodic systems.

In these notes we lay out the basic theory of the Fourier transform, which is of course the most fundamental tool in harmonic analysis and also of major importance in related fields (functional analysis, complex analysis, PDE, number theory, additive combinatorics, representation theory, signal processing, etc.). The Fourier transform, in conjunction with the Fourier inversion formula, allows one to take essentially arbitrary (complex-valued) functions on a group ${G}$ (or more generally, a space ${X}$ that ${G}$ acts on, e.g. a homogeneous space ${G/H}$), and decompose them as a (discrete or continuous) superposition of much more symmetric functions on the domain, such as characters ${\chi: G \rightarrow S^1}$; the precise superposition is given by Fourier coefficients ${\hat f(\xi)}$, which take values in some dual object such as the Pontryagin dual ${\hat G}$ of ${G}$. Characters behave in a very simple manner with respect to translation (indeed, they are eigenfunctions of the translation action), and so the Fourier transform tends to simplify any mathematical problem which enjoys a translation invariance symmetry (or an approximation to such a symmetry), and is somehow “linear” (i.e. it interacts nicely with superpositions). In particular, Fourier analytic methods are particularly useful for studying operations such as convolution ${f, g \mapsto f*g}$ and set-theoretic addition ${A, B \mapsto A+B}$, or the closely related problem of counting solutions to additive problems such as ${x = a_1 + a_2 + a_3}$ or ${x = a_1 - a_2}$, where ${a_1, a_2, a_3}$ are constrained to lie in specific sets ${A_1, A_2, A_3}$. The Fourier transform is also a particularly powerful tool for solving constant-coefficient linear ODE and PDE (because of the translation invariance), and can also approximately solve some variable-coefficient (or slightly non-linear) equations if the coefficients vary smoothly enough and the nonlinear terms are sufficiently tame.

The Fourier transform ${\hat f(\xi)}$ also provides an important new way of looking at a function ${f(x)}$, as it highlights the distribution of ${f}$ in frequency space (the domain of the frequency variable ${\xi}$) rather than physical space (the domain of the physical variable ${x}$). A given property of ${f}$ in the physical domain may be transformed to a rather different-looking property of ${\hat f}$ in the frequency domain. For instance:

• Smoothness of ${f}$ in the physical domain corresponds to decay of ${\hat f}$ in the Fourier domain, and conversely. (More generally, fine scale properties of ${f}$ tend to manifest themselves as coarse scale properties of ${\hat f}$, and conversely.)
• Convolution in the physical domain corresponds to pointwise multiplication in the Fourier domain, and conversely.
• Constant coefficient differential operators such as ${d/dx}$ in the physical domain corresponds to multiplication by polynomials such as ${2\pi i \xi}$ in the Fourier domain, and conversely.
• More generally, translation invariant operators in the physical domain correspond to multiplication by symbols in the Fourier domain, and conversely.
• Rescaling in the physical domain by an invertible linear transformation corresponds to an inverse (adjoint) rescaling in the Fourier domain.
• Restriction to a subspace (or subgroup) in the physical domain corresponds to projection to the dual quotient space (or quotient group) in the Fourier domain, and conversely.
• Frequency modulation in the physical domain corresponds to translation in the frequency domain, and conversely.

(We will make these statements more precise below.)

On the other hand, some operations in the physical domain remain essentially unchanged in the Fourier domain. Most importantly, the ${L^2}$ norm (or energy) of a function ${f}$ is the same as that of its Fourier transform, and more generally the inner product ${\langle f, g \rangle}$ of two functions ${f}$ is the same as that of their Fourier transforms. Indeed, the Fourier transform is a unitary operator on ${L^2}$ (a fact which is variously known as the Plancherel theorem or the Parseval identity). This makes it easier to pass back and forth between the physical domain and frequency domain, so that one can combine techniques that are easy to execute in the physical domain with other techniques that are easy to execute in the frequency domain. (In fact, one can combine the physical and frequency domains together into a product domain known as phase space, and there are entire fields of mathematics (e.g. microlocal analysis, geometric quantisation, time-frequency analysis) devoted to performing analysis on these sorts of spaces directly, but this is beyond the scope of this course.)

In these notes, we briefly discuss the general theory of the Fourier transform, but will mainly focus on the two classical domains for Fourier analysis: the torus ${{\Bbb T}^d := ({\bf R}/{\bf Z})^d}$, and the Euclidean space ${{\bf R}^d}$. For these domains one has the advantage of being able to perform very explicit algebraic calculations, involving concrete functions such as plane waves ${x \mapsto e^{2\pi i x \cdot \xi}}$ or Gaussians ${x \mapsto A^{d/2} e^{-\pi A |x|^2}}$.

One way to study a general class of mathematical objects is to embed them into a more structured class of mathematical objects; for instance, one could study manifolds by embedding them into Euclidean spaces. In these (optional) notes we study two (related) embedding theorems for topological spaces: