Consider the sum of iid real random variables of finite mean and variance for some . Then the sum has mean and variance , and so (by Chebyshev’s inequality) we expect to usually have size . To put it another way, if we consider the normalised sum
In the previous set of notes, we were able to establish various tail bounds on . For instance, from Chebyshev’s inequality one has
and if the original distribution was bounded or subgaussian, we had the much stronger Chernoff bound
Now we look at the distribution of . The fundamental central limit theorem tells us the asymptotic behaviour of this distribution:
Theorem 1 (Central limit theorem) Let be iid real random variables of finite mean and variance for some , and let be the normalised sum (1). Then as , converges in distribution to the standard normal distribution .
Exercise 1 Show that does not converge in probability or in the almost sure sense (in the latter case, we think of as an infinite sequence of iid random variables). (Hint: the intuition here is that for two very different values of , the quantities and are almost independent of each other, since the bulk of the sum is determined by those with . Now make this intuition precise.)
Exercise 2 Use Stirling’s formula from Notes 0a to verify the central limit theorem in the case when is a Bernoulli distribution, taking the values and only. (This is a variant of Exercise 2 from those notes, or Exercise 2 from Notes 1. It is easy to see that once one does this, one can rescale and handle any other two-valued distribution also.)
Exercise 3 Use Exercise 9 from Notes 1 to verify the central limit theorem in the case when is gaussian.
Note we are only discussing the case of real iid random variables. The case of complex random variables (or more generally, vector-valued random variables) is a little bit more complicated, and will be discussed later in this post.
The central limit theorem (and its variants, which we discuss below) are extremely useful tools in random matrix theory, in particular through the control they give on random walks (which arise naturally from linear functionals of random matrices). But the central limit theorem can also be viewed as a “commutative” analogue of various spectral results in random matrix theory (in particular, we shall see in later lectures that the Wigner semicircle law can be viewed in some sense as a “noncommutative” or “free” version of the central limit theorem). Because of this, the techniques used to prove the central limit theorem can often be adapted to be useful in random matrix theory. Because of this, we shall use these notes to dwell on several different proofs of the central limit theorem, as this provides a convenient way to showcase some of the basic methods that we will encounter again (in a more sophisticated form) when dealing with random matrices.
— 1. Reductions —
We first record some simple reductions one can make regarding the proof of the central limit theorem. Firstly, we observe scale invariance: if the central limit theorem holds for one random variable , then it is easy to see that it also holds for for any real with . Because of this, one can normalise to the case when has mean and variance , in which case simplifies to
The other reduction we can make is truncation: to prove the central limit theorem for arbitrary random variables of finite mean and variance, it suffices to verify the theorem for bounded random variables. To see this, we first need a basic linearity principle:
Exercise 4 (Linearity of convergence) Let be a finite-dimensional real or complex vector space, be sequences of -valued random variables (not necessarily independent), and let be another pair of -valued random variables. Let be scalars converging to respectively.
- If converges in distribution to , and converges in distribution to , and at least one of is deterministic, show that converges in distribution to .
- If converges in probability to , and converges in probability to , show that converges in probability to .
- If converges almost surely to , and converges almost surely , show that converges almost surely to .
Show that the first part of the exercise can fail if are not deterministic.
Now suppose that we have established the central limit theorem for bounded random variables, and want to extend to the unbounded case. Let be an unbounded random variable, which we can normalise to have mean zero and unit variance. Let be a truncation parameter depending on which, as usual, we shall optimise later, and split in the usual fashion (; ). Thus we have as usual.
Let be the mean and variance of the bounded random variable . As we are assuming that the central limit theorem is already true in the bounded case, we know that if we fix to be independent of , then
converges in distribution to . By a diagonalisation argument, we conclude that there exists a sequence going (slowly) to infinity with , such that still converges in distribution to .
For such a sequence, we see from dominated convergence that converges to . As a consequence of this and Exercise 4, we see that
converges in distribution to .
Meanwhile, from dominated convergence again, converges to . From this and (2) we see that
converges in distribution to . Finally, from linearity of expectation we have . Summing (using Exercise 4), we obtain the claim.
Remark 1 The truncation reduction is not needed for some proofs of the central limit theorem (notably the Fourier-analytic proof), but is very convenient for some of the other proofs that we will give here, and will also be used at several places in later notes.
By applying the scaling reduction after the truncation reduction, we observe that to prove the central limit theorem, it suffices to do so for random variables which are bounded and which have mean zero and unit variance. (Why is it important to perform the reductions in this order?)
— 2. The Fourier method —
Example 1 The signed Bernoulli distribution has characteristic function .
(or equivalently, by identifying with in the usual manner.)
More generally, one can define the characteristic function on any finite dimensional real or complex vector space , by identifying with or . (Strictly speaking, one either has to select an inner product on to do this, or else make the characteristic function defined on the dual space instead of on itself; see for instance my notes on the Fourier transform in general locally compact abelian groups. But we will not need to care about this subtlety in our applications.)
The characteristic function is clearly bounded in magnitude by , and equals at the origin. By the Lebesgue dominated convergence theorem, is continuous in .
Exercise 6 (Riemann-Lebesgue lemma) Show that if is an absolutely continuous random variable taking values in or , then as . Show that the claim can fail when the absolute continuity hypothesis is dropped.
Exercise 7 Show that the characteristic function of a random variable taking values in or is in fact uniformly continuous on its domain.
for all . In particular, establish the partial Taylor expansion
where is a quantity that goes to zero as , times .
Exercise 9 Establish (7) in the case that is subgaussian, and show that the series converges locally uniformly in .
Note that the characteristic function depends only on the distribution of : if , then . The converse statement is true also: if , then . This follows from a more general (and useful) fact, known as Lévy’s continuity theorem.
Theorem 2 (Lévy continuity theorem, special case) Let be a finite-dimensional real or complex vector space, and let be a sequence of -valued random variables, and let be an additional -valued random variable. Then the following statements are equivalent:
- (i) converges pointwise to .
- (ii) converges in distribution to .
Proof: Without loss of generality we may take .
Now suppose that (i) holds, and we wish to show that (ii) holds. By Exercise 23(iv) of Notes 0, it suffices to show that
whenever is a continuous, compactly supported function. By approximating uniformly by Schwartz functions (e.g. using the Stone-Weierstrass theorem), it suffices to show this for Schwartz functions . But then we have the Fourier inversion formula
is a Schwartz function, and is in particular absolutely integrable (see e.g. these lecture notes of mine). From the Fubini-Tonelli theorem, we thus have
Exercise 10 (Lévy’s continuity theorem, full version) Let be a finite-dimensional real or complex vector space, and let be a sequence of -valued random variables. Suppose that converges pointwise to a limit . Show that the following are equivalent:
- (i) is continuous at .
- (ii) is a tight sequence.
- (iii) is the characteristic function of a -valued random variable (possibly after extending the sample space).
- (iv) converges in distribution to some -valued random variable (possibly after extending the sample space).
Hint: To get from (ii) to the other conclusions, use Prokhorov’s theorem and Theorem 2. To get back to (ii) from (i), use (8) for a suitable Schwartz function . The other implications are easy once Theorem 2 is in hand.
Remark 3 Lévy’s continuity theorem is very similar in spirit to Weyl’s criterion in equidistribution theory.
In Fourier analysis, we learn that the Fourier transform is a particularly well-suited tool for studying convolutions. The probability theory analogue of this fact is that characteristic functions are a particularly well-suited tool for studying sums of independent random variables. More precisely, we have
and more generally, for any linear transformation , one has
In particular, in the normalised setting (4), we have the simple relationship
We now have enough machinery to give a quick proof of the central limit theorem:
for sufficiently small , or equivalently
for sufficiently small . Applying (11), we conclude that
as for any fixed . But by Exercise 5, is the characteristic function of the normal distribution . The claim now follows from the Lévy continuity theorem.
Exercise 13 (Vector-valued central limit theorem) Let be a random variable taking values in with finite second moment. Define the covariance matrix to be the matrix whose entry is the covariance .
- Show that the covariance matrix is positive semi-definite real symmetric.
- Conversely, given any positive definite real symmetric matrix and , show that the normal distribution , given by the absolutely continuous measure
has mean and covariance matrix , and has a characteristic function given by
How would one define the normal distribution if degenerated to be merely positive semi-definite instead of positive definite?
- If is the sum of iid copies of , show that converges in distribution to .
Exercise 14 (Complex central limit theorem) Let be a complex random variable of mean , whose real and imaginary parts have variance and covariance . Let be iid copies of . Show that as , the normalised sums (1) converge in distribution to the standard complex gaussian .
Exercise 15 (Lindeberg central limit theorem) Let be a sequence of independent (but not necessarily identically distributed) real random variables, normalised to have mean zero and variance one. Assume the (strong) Lindeberg condition
where is the truncation of to large values. Show that as , converges in distribution to . (Hint: modify the truncation argument.)
A more sophisticated version of the Fourier-analytic method gives a more quantitative form of the central limit theorem, namely the Berry-Esséen theorem.
Proof: (Optional) Write ; our task is to show that
for all . We may of course assume that , as the claim is trivial otherwise.
Let be a small absolute constant to be chosen later. Let be an non-negative Schwartz function with total mass whose Fourier transform is supported in , and let be the smoothed out version of , defined as
Observe that is decreasing from to .
for every , where the subscript means that the implied constant depends on . Indeed, suppose that (13) held. Define
Let be arbitrary, and let be a large absolute constant to be chosen later. We write
and thus by (13)
Meanwhile, from (14) and an integration by parts we see that
From the bounded density of and the rapid decrease of we have
Putting all this together, we see that
A similar argument gives a lower bound
Taking suprema over , we obtain
If is large enough (depending on ), we can make and small, and thus absorb the latter two terms on the right-hand side into the left-hand side. This gives the desired bound .
for any ; taking expectations and using the definition of we have
and in particular
if and is small enough. Applying (11), we conclude that
if . Meanwhile, from Exercise 5 we have . Elementary calculus then gives us
(say) if is small enough. Inserting this bound into (16) we obtain the claim.
Exercise 16 Show that the error terms here are sharp (up to constants) when is a signed Bernoulli random variable.
— 3. The moment method —
The above Fourier-analytic proof of the central limit theorem is one of the quickest (and slickest) proofs available for this theorem, and is accordingly the “standard” proof given in probability textbooks. However, it relies quite heavily on the Fourier-analytic identities in Exercise 12, which in turn are extremely dependent on both the commutative nature of the situation (as it uses the identity ) and on the independence of the situation (as it uses identities of the form ). When we turn to random matrix theory, we will often lose (or be forced to modify) one or both of these properties, which often causes the Fourier-analytic methods to fail spectacularly. Because of this, it is also important to look for non-Fourier based methods to prove results such as the central limit theorem. These methods often lead to proofs that are lengthier and more technical than the Fourier proofs, but also tend to be more robust, and in particular can often be extended to random matrix theory situations. Thus both the Fourier and non-Fourier proofs will be of importance in this course.
The most elementary (but still remarkably effective) method available in this regard is the moment method, which we have already used in the previous notes, which seeks to understand the distribution of a random variable via its moments . In principle, this method is equivalent to the Fourier method, through the identity (7); but in practice, the moment method proofs tend to look somewhat different than the Fourier-analytic ones, and it is often more apparent how to modify them to non-independent or non-commutative settings.
We first need an analogue of the Lévy continuity theorem. Here we encounter a technical issue: whereas the Fourier phases were bounded, the moment functions become unbounded at infinity. However, one can deal with this issue as long as one has sufficient decay:
- (i) For every , converges pointwise to .
- (ii) converges in distribution to .
Proof: We first show how (ii) implies (i). Let be a truncation parameter, and let be a smooth function that equals on and vanishes outside of . Then for any , the convergence in distribution implies that converges to . On the other hand, from the uniform subgaussian hypothesis, one can make and arbitrarily small for fixed by making large enough. Summing, and then letting go to infinity, we obtain (i).
Conversely, suppose (i) is true. From the uniform subgaussian hypothesis, the have moment bounded by for all and some independent of (see Exercise 4 from Notes 0). From Taylor’s theorem with remainder (and Stirling’s formula, Notes 0a) we conclude
uniformly in and . Similarly for . Taking limits using (i) we see that
Then letting , keeping fixed, we see that converges pointwise to for each , and the claim now follows from the Lévy continuity theorem.
Remark 5 One corollary of Theorem 4 is that the distribution of a subgaussian random variable is uniquely determined by its moments (actually, this could already be deduced from Exercise 9 and Remark 2). The situation can fail for distributions with slower tails, for much the same reason that a smooth function is not determined by its derivatives at one point if that function is not analytic.
The Fourier inversion formula provides an easy way to recover the distribution from the characteristic function. Recovering a distribution from its moments is more difficult, and sometimes requires tools such as analytic continuation; this problem is known as the inverse moment problem and will not be discussed here.
To prove the central limit theorem, we know from the truncation method that we may assume without loss of generality that is bounded (and in particular subgaussian); we may also normalise to have mean zero and unit variance. From the Chernoff bound (3) we know that the are uniformly subgaussian; so by Theorem 4, it suffices to show that
for all , where is a standard gaussian variable.
The moments are easy to compute:
Exercise 17 Let be a natural number, and let . Show that vanishes when is odd, and equal to when is even. (This can either be done directly by using the Gamma function, or by using Exercise 5 and Exercise 9.)
So now we need to compute . Using (4) and linearity of expectation, we can expand this as
To understand this expression, let us first look at some small values of .
- For , this expression is trivially .
- For , this expression is trivially , thanks to the mean zero hypothesis on .
- For , we can split this expression into the diagonal and off-diagonal components:
Each summand in the first sum is , as has unit variance. Each summand in the second sum is , as the have mean zero and are independent. So the second moment is .
- For , we have a similar expansion
The summands in the latter two sums vanish because of the (joint) independence and mean zero hypotheses. The summands in the first sum need not vanish, but are , so the first term is , which is asymptotically negligible, so the third moment goes to .
- For , the expansion becomes quite complicated:
Again, most terms vanish, except for the first sum, which is and is asymptotically negligible, and the sum , which by the independence and unit variance assumptions works out to . Thus the fourth moment goes to (as it should).
Now we tackle the general case. Ordering the indices as for some , with each occuring with multiplicity and using elementary enumerative combinatorics, we see that is the sum of all terms of the form
where , are positive integers adding up to , and is the multinomial coefficient
The total number of such terms depends only on (in fact, it is (exercise!), though we will not need this fact).
As we already saw from the small examples, most of the terms vanish, and many of the other terms are negligible in the limit . Indeed, if any of the are equal to , then every summand in (17) vanishes, by joint independence and the mean zero hypothesis. Thus, we may restrict attention to those expressions (17) for which all the are at least . Since the sum up to , we conclude that is at most .
On the other hand, the total number of summands in (17) is clearly at most (in fact it is ), and the summands are bounded (for fixed ) since is bounded. Thus, if is strictly less than , then the expression in (17) is and goes to zero as . So, asymptotically, the only terms (17) which are still relevant are those for which is equal to . This already shows that goes to zero when is odd. When is even, the only surviving term in the limit is now when and . But then by independence and unit variance, the expectation in (17) is , and so this term is equal to
and the main term is happily equal to the moment as computed in Exercise 17.
— 4. The Lindeberg swapping trick —
The moment method proof of the central limit theorem that we just gave consisted of four steps:
- (Truncation and normalisation step) A reduction to the case when was bounded with zero mean and unit variance.
- (Inverse moment step) A reduction to a computation of asymptotic moments .
- (Analysis step) Showing that most terms in the expansion of this asymptotic moment were zero, or went to zero as .
- (Algebra step) Using enumerative combinatorics to compute the remaining terms in the expansion.
In this particular case, the enumerative combinatorics was very classical and easy – it was basically asking for the number of ways one can place balls in boxes, so that the box contains balls, and the answer is well known to be given by the multinomial . By a small algebraic miracle, this result matched up nicely with the computation of the moments of the gaussian .
However, when we apply the moment method to more advanced problems, the enumerative combinatorics can become more non-trivial, requiring a fair amount of combinatorial and algebraic computation. The algebraic miracle that occurs at the end of the argument can then seem like a very fortunate but inexplicable coincidence, making the argument somehow unsatisfying despite being rigorous.
In a 1922 paper, Lindeberg observed that there was a very simple way to decouple the algebraic miracle from the analytic computations, so that all relevant algebraic identities only need to be verified in the special case of gaussian random variables, in which everything is much easier to compute. This Lindeberg swapping trick (or Lindeberg replacement trick) will be very useful in the later theory of random matrices, so we pause to give it here in the simple context of the central limit theorem.
The basic idea is follows. We repeat the truncation-and-normalisation and inverse moment steps in the preceding argument. Thus, are iid copies of a boudned real random variable of mean zero and unit variance, and we wish to show that , where , where is fixed.
Now let be iid copies of the gaussian itself: . Because the sum of independent gaussians is again a gaussian (Exercise 9 from Notes 1, we see that the random variable
already has the same distribution as : . Thus, it suffices to show that
Now we perform the analysis part of the moment method argument again. We can expand into terms (17) as before, and discard all terms except for the term as being . Similarly, we can expand into very similar terms (but with the replaced by ) and again discard all but the term.
But by hypothesis, the second moments of and match: . Thus, by joint independence, the term (17) for is exactly equal to that of . And the claim follows.
This is almost exactly the same proof as in the previous section, but note that we did not need to compute the multinomial coefficient , nor did we need to verify the miracle that this coefficient matched (up to normalising factors) to the moments of the gaussian. Instead, we used the much more mundane “miracle” that the sum of independent gaussians was again a gaussian.
To put it another way, the Lindeberg replacement trick factors a universal limit theorem, such as the central limit theorem, into two components:
- A universality or invariance result, which shows that the distribution (or other statistics, such as moments) of some random variable is asymptotically unchanged in the limit if each of the input variables are replaced by a gaussian substitute ; and
- The gaussian case, which computes the asymptotic distribution (or other statistic) of in the case when are all gaussians.
The former type of result tends to be entirely analytic in nature (basically, one just needs to show that all error terms that show up when swapping with are ), while the latter type of result tends to be entirely algebraic in nature (basically, one just needs to exploit the many pleasant algebraic properties of gaussians). This decoupling of the analysis and algebra steps tends to simplify both, both at a technical level and at a conceptual level.
— 5. Individual swapping —
In the above argument, we swapped all the original input variables with gaussians en masse. There is also a variant of the Lindeberg trick in which the swapping is done individually. To illustrate the individual swapping method, let us use it to show the following weak version of the Berry-Esséen theorem:
Theorem 5 (Berry-Esséen theorem, weak form) Let have mean zero, unit variance, and finite third moment, and let be smooth with uniformly bounded derivatives up to third order. Let , where are iid copies of . Then we have
Proof: Let and be in the previous section. As , it suffices to show that
We telescope this (using linearity of expectation) as
is a partially swapped version of . So it will suffice to show that
uniformly for .
where the implied constants depend on but not on . Now, by construction, the moments of and match to second order, thus
and the claim follows. (Note from Hölder’s inequality that .)
Remark 6 The above argument relied on Taylor expansion, and the hypothesis that the moments of and matched to second order. It is not hard to see that if we assume more moments matching (e.g. ), and more smoothness on , we see that we can improve the factor on the right-hand side. Thus we see that we expect swapping methods to become more powerful when more moments are matching. We will see this when we discuss the four moment theorem of Van Vu and myself in later lectures, which (very) roughly speaking asserts that the spectral statistics of two random matrices are asymptotically indistinguishable if their coefficients have matching moments to fourth order.
of . For any , one can upper bound this expression by
where is a smooth function equal to on that vanishes outside of , and has third derivative . By Theorem 5, we thus have
On the other hand, as has a bounded probability density function, we have
A very similar argument gives the matching lower bound, thus
Comparing this with Theorem 3 we see that we have lost an exponent of . In our applications to random matrices, this type of loss is acceptable, and so the swapping argument is a reasonable substitute for the Fourier-analytic one in this case. Also, this method is quite robust, and in particular extends well to higher dimensions; we will return to this point in later lectures, but see for instance Appendix D of this paper of myself and Van Vu for an example of a multidimensional Berry-Esséen theorem proven by this method.
On the other hand there is another method that can recover this loss while still avoiding Fourier-analytic techniques; we turn to this topic next.
— 6. Stein’s method —
Stein’s method, introduced by Charles Stein in 1970 (who should not be confused with a number of other eminent mathematicians with this surname, including my advisor), is a powerful method to show convergence in distribution to a special distribution, such as the gaussian. In several recent papers, this method has been used to control several expressions of interest in random matrix theory (e.g. the distribution of moments, or of the Stieltjes transform.) We will not use it much in this course, but this method is of independent interest, so I will briefly discuss (a very special case of) it here.
It turns out that the converse is true: if is a real random variable with the property that
- (i) converges to zero whenever is continuously differentiable with both bounded.
- (ii) converges in distribution to .
Proof: To show that (ii) implies (i), it is not difficult to use the uniform bounded second moment hypothesis and a truncation argument to show that converges to when is continuously differentiable with both bounded, and the claim then follows from (22).
Now we establish the converse. It suffices to show that
whenever is a bounded Lipschitz function. We may normalise to be bounded in magnitude by .
Comparing this with (22), one may thus hope to find a representation of the form
for some continuously differentiable with both bounded. This is a simple ODE and can be easily solved (by the method of integrating factors) to give a solution , namely
(One could dub the Stein transform of , although this term does not seem to be in widespread use.) By the fundamental theorem of calculus, is continuously differentiable and solves (24). Using (23), we may also write as
By completing the square, we see that . Inserting this into (25) and using the bounded nature of , we conclude that for ; inserting it instead into (26), we have for . Finally, easy estimates give for . Thus for all we have
we see on differentiation under the integral sign (and using the Lipschitz nature of ) that for ; a similar manipulation (starting from (25)) applies for , and we in fact conclude that for all .
Applying (24) with and taking expectations, we have
By the hypothesis (i), the right-hand side goes to zero, hence the left-hand side does also, and the claim follows.
The above theorem gave only a qualitative result (convergence in distribution), but the proof is quite quantitative, and can be used to in particular to give Berry-Esséen type results. To illustrate this, we begin with a strengthening of Theorem 5 that reduces the number of derivatives of that need to be controlled:
Theorem 7 (Berry-Esséen theorem, less weak form) Let have mean zero, unit variance, and finite third moment, and let be smooth, bounded in magnitude by , and Lipschitz. Let , where are iid copies of . Then we have
Proof: Set .
We expand . For each , we then split , where (cf. (19)). By the fundamental theorem of calculus, we have
so we may rewrite (28) as
Recall from the proof of Theorem 6 that and . By the product rule, this implies that has a Lipschitz constant of . Applying (24) and the definition of , we conclude that has a Lipschitz constant of . Thus we can bound the previous expression as
and the claim follows from Hölder’s inequality.
This improvement already reduces the loss in (20) to . But one can do better still by pushing the arguments further. Let us illustrate this in the model case when the not only have bounded third moment, but are in fact bounded:
Proof: Write , thus we seek to show that
Let be the Stein transform (25) of . is not continuous, but it is not difficult to see (e.g. by a limiting argument) that we still have the estimates and (in a weak sense), and that has a Lipschitz norm of (here we use the hypothesis ). A similar limiting argument gives
and by arguing as in the proof of Theorem 7, we can write the right-hand side as
From (24), is equal to , plus a function with Lipschitz norm . Thus, we can write the above expression as
The terms cancel (due to the independence of and , and the normalised mean and variance of ), so we can simplify this as
and so we conclude that
Since and are bounded, and is non-increasing, we have
applying the second inequality and using independence to once again eliminate the factor, we see that
which implies (by another appeal to the non-increasing nature of and the bounded nature of ) that
or in other words that
Similarly, using the lower bound inequalities, one has
Moving up and down by , and using the bounded density of , we obtain the claim.
Actually, one can use Stein’s method to obtain the full Berry-Esséen theorem, but the computations get somewhat technical, requiring an induction on to deal with the contribution of the exceptionally large values of : see this paper of Barbour and Hall.
— 7. Predecessor comparison —
Suppose one had never heard of the normal distribution, but one still suspected the existence of the central limit theorem – thus, one thought that the sequence of normalised distributions was converging in distribution to something, but was unsure what the limit was. Could one still work out what that limit was?
Certainly in the case of Bernoulli distributions, one could work explicitly using Stirling’s formula (see Exercise 2), and the Fourier-analytic method would also eventually work. Let us now give a third way to (heuristically) derive the normal distribution as the limit of the central limit theorem. The idea is to compare with its predecessor , using the recursive formula
(normalising to have mean zero and unit variance as usual; let us also truncate to be bounded, for simplicity). Let us hypothesise that and are approximately the same distribution; let us also conjecture that this distribution is absolutely continuous, given as for some smooth . (If we secretly knew the central limit theorem, we would know that is in fact , but let us pretend that we did not yet know this fact.) Thus, for any test function , we expect
Now let us try to combine this with (30). We assume to be smooth, and Taylor expand to third order:
Taking expectations, and using the independence of and , together with the normalisations on , we obtain
Up to errors of , one can approximate the second term here by . We then insert (31) and are led to the heuristic equation
Changing variables for the first term on the right hand side, and integrating by parts for the second term, we have
Since was an arbitrary test function, this suggests the heuristic equation
Taylor expansion gives
which leads us to the heuristic ODE
where is the Ornstein-Uhlenbeck operator
Observe that is the total derivative of ; integrating from infinity, we thus get
which is (21), and can be solved by standard ODE methods as for some ; the requirement that probability density functions have total mass then gives the constant as , as we knew it must.
The above argument was not rigorous, but one can make it so with a significant amount of PDE machinery. If we view (or more precisely, ) as a time parameter, and view as depending on time, the above computations heuristically lead us eventually to the Fokker-Planck equation for the Ornstein-Uhlenbeck process,
which is a linear parabolic equation that is fortunate enough that it can be solved exactly (indeed, it is not difficult to transform this equation to the linear heat equation by some straightforward changes of variable). Using the spectral theory of the Ornstein-Uhlenbeck operator , one can show that solutions to this equation starting from an arbitrary probability distribution, are attracted to the gaussian density function , which as we saw is the steady state for this equation. The stable nature of this attraction can eventually be used to make the above heuristic analysis rigorous. However, this requires a substantial amount of technical effort (e.g. developing the theory of Sobolev spaces associated to ) and will not be attempted here. One can also proceed by relating the Fokker-Planck equation to the associated stochastic process, namely the Ornstein-Uhlenbeck process, but this requires one to first set up stochastic calculus, which we will not do here. (The various Taylor expansion calculations we have performed in these notes, though, are closely related to stochastic calculus tools such as Ito’s lemma.) Stein’s method, discussed above, can also be interpreted as a way of making the above computations rigorous (by not working with the density function directly, but instead testing the random variable against various test functions ).
This argument does, though highlight two ideas which we will see again in later notes when studying random matrices. Firstly, that it is profitable to study the distribution of some random object by comparing it with its predecessor , which one presumes to have almost the same distribution. Secondly, we see that it may potentially be helpful to approximate (in some weak sense) a discrete process (such as the iteration of the scheme (30)) with a continuous evolution (in this case, a Fokker-Planck equation) which can then be controlled using PDE methods.