You are currently browsing the monthly archive for March 2010.

(Linear) Fourier analysis can be viewed as a tool to study an arbitrary function {f} on (say) the integers {{\bf Z}}, by looking at how such a function correlates with linear phases such as {n \mapsto e(\xi n)}, where {e(x) := e^{2\pi i x}} is the fundamental character, and {\xi \in {\bf R}} is a frequency. These correlations control a number of expressions relating to {f}, such as the expected behaviour of {f} on arithmetic progressions {n, n+r, n+2r} of length three.

In this course we will be studying higher-order correlations, such as the correlation of {f} with quadratic phases such as {n \mapsto e(\xi n^2)}, as these will control the expected behaviour of {f} on more complex patterns, such as arithmetic progressions {n, n+r, n+2r, n+3r} of length four. In order to do this, we must first understand the behaviour of exponential sums such as

\displaystyle \sum_{n=1}^N e( \alpha n^2 ).

Such sums are closely related to the distribution of expressions such as {\alpha n^2 \hbox{ mod } 1} in the unit circle {{\bf T} := {\bf R}/{\bf Z}}, as {n} varies from {1} to {N}. More generally, one is interested in the distribution of polynomials {P: {\bf Z}^d \rightarrow {\bf T}} of one or more variables taking values in a torus {{\bf T}}; for instance, one might be interested in the distribution of the quadruplet {(\alpha n^2, \alpha (n+r)^2, \alpha(n+2r)^2, \alpha(n+3r)^2)} as {n,r} both vary from {1} to {N}. Roughly speaking, once we understand these types of distributions, then the general machinery of quadratic Fourier analysis will then allow us to understand the distribution of the quadruplet {(f(n), f(n+r), f(n+2r), f(n+3r))} for more general classes of functions {f}; this can lead for instance to an understanding of the distribution of arithmetic progressions of length {4} in the primes, if {f} is somehow related to the primes.

More generally, to find arithmetic progressions such as {n,n+r,n+2r,n+3r} in a set {A}, it would suffice to understand the equidistribution of the quadruplet {(1_A(n), 1_A(n+r), 1_A(n+2r), 1_A(n+3r))} in {\{0,1\}^4} as {n} and {r} vary. This is the starting point for the fundamental connection between combinatorics (and more specifically, the task of finding patterns inside sets) and dynamics (and more specifically, the theory of equidistribution and recurrence in measure-preserving dynamical systems, which is a subfield of ergodic theory). This connection was explored in one of my previous classes; it will also be important in this course (particularly as a source of motivation), but the primary focus will be on finitary, and Fourier-based, methods.

The theory of equidistribution of polynomial orbits was developed in the linear case by Dirichlet and Kronecker, and in the polynomial case by Weyl. There are two regimes of interest; the (qualitative) asymptotic regime in which the scale parameter {N} is sent to infinity, and the (quantitative) single-scale regime in which {N} is kept fixed (but large). Traditionally, it is the asymptotic regime which is studied, which connects the subject to other asymptotic fields of mathematics, such as dynamical systems and ergodic theory. However, for many applications (such as the study of the primes), it is the single-scale regime which is of greater importance. The two regimes are not directly equivalent, but are closely related: the single-scale theory can be usually used to derive analogous results in the asymptotic regime, and conversely the arguments in the asymptotic regime can serve as a simplified model to show the way to proceed in the single-scale regime. The analogy between the two can be made tighter by introducing the (qualitative) ultralimit regime, which is formally equivalent to the single-scale regime (except for the fact that explicitly quantitative bounds are abandoned in the ultralimit), but resembles the asymptotic regime quite closely.

We will view the equidistribution theory of polynomial orbits as a special case of Ratner’s theorem, which we will study in more generality later in this course.

For the finitary portion of the course, we will be using asymptotic notation: {X \ll Y}, {Y \gg X}, or {X = O(Y)} denotes the bound {|X| \leq CY} for some absolute constant {C}, and if we need {C} to depend on additional parameters then we will indicate this by subscripts, e.g. {X \ll_d Y} means that {|X| \leq C_d Y} for some {C_d} depending only on {d}. In the ultralimit theory we will use an analogue of asymptotic notation, which we will review later in these notes.

Read the rest of this entry »

Assaf Naor and I have just uploaded to the arXiv our joint paper “Scale-oblivious metric fragmentation and the nonlinear Dvoretzky theorem“.

Consider a finite metric space {(X,d)}, with {X} being a set of {n} points. It is not always the case that this space can be isometrically embedded into a Hilbert space such as {\ell^2}; for instance, in a Hilbert space every pair of points has a unique midpoint, but uniqueness can certainly fail for finite metric spaces (consider for instance the graph metric on a four-point diamond). The situation improves, however, if one allows (a) the embedding to have some distortion, and (b) one is willing to embed just a subset of {X}, rather than all of {X}. More precisely, for any integer {n} and any distortion {D \geq 1}, let {k(n,D)} be the largest integer with the property that given every {n}-point metric space {(X,d)}, there exists a {k}-point subset {Y} of {X}, and a map {f: Y \rightarrow \ell^2}, which has distortion at most {D} in the sense that

\displaystyle  d(y,y') \leq \| f(y) - f(y') \|_{\ell^2} \leq D d(y,y')

for all {y,y' \in Y}.

Bourgain, Figiel, and Milman established that for any fixed {D>1}, one has the lower bound {k(n,D) \gg_D \log n}, and also showed for {D} sufficiently close to {1} there was a matching upper bound {k(n,D) \ll_D \log n}. This type of result was called a nonlinear Dvoretsky theorem, in analogy with the linear Dvoretsky theorem that asserts that given any {n}-dimensional normed vector space and {D > 1}, there existed a {k}-dimensional subspace which embedded with distortion {D} into {\ell^2}, with {k \sim_D \log n}. In other words, one can ensure that about {\log n} of the {n} points in an arbitrary metric space behave in an essentially Euclidean manner, and that this is best possible if one wants the distortion to be small.

Bartel, Linial, Mendel, and Naor observed that there was a threshold phenomenon at {D=2}. Namely, for {D < 2}, the Bourgain-Figiel-Milman bounds were sharp with {k(n,D) \sim_D \log n}; but for {D>2} one instead had a power law

\displaystyle  n^{1-B(D)} \ll_D k(n,D) \ll n^{1-b(D)}

for some {0 < b(D) \leq B(D) < 1}. In other words, once one allows distortion by factors greater than {2}, one can now embed a polynomial portion of the points, rather than just a logarithmic portion, into a Hilbert space. This has some applications in theoretical computer science to constructing “approximate distance oracles” for high-dimensional sets of data. (The situation at the critical value {D=2} is still unknown.)

In the special case that the metric {d} is an ultrametric, so that the triangle inequality {d(x,z) \leq d(x,y) + d(y,z)} is upgraded to the ultra-triangle inequality {d(x,z) \leq \max(d(x,y), d(y,z))}, then it is an easy exercise to show that {X} can be embedded isometrically into a Hilbert space, and in fact into a sphere of radius {\hbox{diam}(X)/\sqrt{2}}. Indeed, this can be established by an induction on the cardinality of {X}, using the ultrametric inequality to partition any finite ultrametric space {X} of two or more points into sets of strictly smaller diameter that are all separated by {\hbox{diam}(X)}, and using the inductive hypothesis and Pythagoras’ theorem.

One can then replace the concept of embedding into a Hilbert space, with the apparently stronger concept of embedding into an ultrametric space; this is useful for the computer science applications as ultrametrics have a tree structure which allows for some efficient algorithms for computing distances in such spaces. As it turns out, all the preceding constructions carry over without difficulty to this setting; thus, for {D<2}, one can embed a logarithmic number of points with distortion {D} into an ultrametric space, and for {D>2} one can embed a polynomial number of points.

One can view the task of locating a subset of a metric space that is equivalent (up to bounded distortion) to an ultrametric as that of fragmenting a metric space into a tree-like structure. For instance, the standard metric on the arithmetic progression {\{1,\ldots,3^m\}} of length {n=3^m} is not an ultrametric (and in fact needs a huge distortion factor of {3^m} in order to embed into an ultrametric space), but if one restricts to the Cantor-like subset of integers in {\{1,\ldots,3^m\}} whose base {3} expansion consists solely of {0}s and {1}s, then one easily verifies that the resulting set is fragmented enough that the standard metric is equivalent (up to a distortion factor of {3}) to an ultrametric, namely the metric {d(x,y) := 3^{j}/2} where {j} is the largest integer for which {3^j} divides {y-x} and {x \neq y}.

The above fragmentation constructions were somewhat complicated and deterministic in nature. Mendel and Naor introduced a simpler probabilistic method, based on random partition trees of {X}, which reproved the above results in the high-distortion case {D \gg 1} (with {b(D), B(D) \sim 1/D} in the limit {D \rightarrow \infty}). However, this method had inherent limitations in the low-distortion case, in particular failing for {D < 3} (basically because the method would establish a stronger embedding result which fails in that regime).

In this paper, we introduce a variant of the random partition tree method, which involves a random fragmentation at a randomly chosen set of scales, that works all the way down to {D>2} and gives a clean value for the exponent {b(D)}, namely {2e/D}; in fact we have the more precise result that we can take {b(D)} to be {1-\theta}, where {0 < \theta < 1} is the unique solution to the equation

\displaystyle  \frac{2}{D} = (1-\theta) \theta^{\theta/(1-\theta)},

and that this is the limit of the method if one chooses a “scale-oblivious” approach that applies uniformly to all metric spaces, ignoring the internal structure of each individual metric space.

The construction ends up to be relatively simple (taking about three pages). The basic idea is as follows. Pick two scales {R > r > 0}, and let {x_1, x_2, x_3, \ldots} be a random sequence of points in {X} (of course, being finite, every point in {X} will almost surely be visited infinitely often by this sequence). We can then partition {X} into pieces {Y_1, Y_2, \ldots}, by defining {Y_m} to be the set of points {x} for which {x_m} is the first point in the sequence to lie in the ball {B(x,R)}. We then let {S_m} be the subset of {Y_m} in which {x_m} actually falls in the smaller ball {B(x,r)}. As a consequence, we see that each {S_m} is contained in a ball of radius {r} (namely {B(x_m,r)}), but that any two {S_m, S_{m'}} are separated by a distance of at least {R-r} (from the triangle inequality). This gives one layer of a quasi-ultrametric tree structure on {\bigcup_m S_m}; if one iterates this over many different pairs of scales {R, r}, one gets a full quasi-ultrametric tree structure, which one can then adjust with bounded distortion to a genuine ultrametric structure. The game is then to optimise the choice of {R, r} so as to maximise the portion of {X} that remains in the tree; it turns out that a suitably random choice of such scales is the optimal one.

The standard modern foundation of mathematics is constructed using set theory. With these foundations, the mathematical universe of objects one studies contains not only the “primitive” mathematical objects such as numbers and points, but also sets of these objects, sets of sets of objects, and so forth. (In a pure set theory, the primitive objects would themselves be sets as well; this is useful for studying the foundations of mathematics, but for most mathematical purposes it is more convenient, and less conceptually confusing, to refrain from modeling primitive objects as sets.) One has to carefully impose a suitable collection of axioms on these sets, in order to avoid paradoxes such as Russell’s paradox; but with a standard axiom system such as Zermelo-Fraenkel-Choice (ZFC), all actual paradoxes that we know of are eliminated. Still, one might be somewhat unnerved by the presence in set theory of statements which, while not genuinely paradoxical in a strict sense, are still highly unintuitive; Cantor’s theorem on the uncountability of the reals, and the Banach-Tarski paradox, are perhaps the two most familiar examples of this.

One may suspect that the reason for this unintuitive behaviour is the presence of infinite sets in one’s mathematical universe. After all, if one deals solely with finite sets, then there is no need to distinguish between countable and uncountable infinities, and Banach-Tarski type paradoxes cannot occur.

On the other hand, many statements in infinitary mathematics can be reformulated into equivalent statements in finitary mathematics (involving only finitely many points or numbers, etc.); I have explored this theme in a number of previous blog posts. So, one may ask: what is the finitary analogue of statements such as Cantor’s theorem or the Banach-Tarski paradox?

The finitary analogue of Cantor’s theorem is well-known: it is the assertion that {2^n > n} for every natural number {n}, or equivalently that the power set of a finite set {A} of {n} elements cannot be enumerated by {A} itself. Though this is not quite the end of the story; after all, one also has {n+1 > n} for every natural number {n}, or equivalently that the union {A \cup \{a\}} of a finite set {A} and an additional element {a} cannot be enumerated by {A} itself, but the former statement extends to the infinite case, while the latter one does not. What causes these two outcomes to be distinct?

On the other hand, it is less obvious what the finitary version of the Banach-Tarski paradox is. Note that this paradox is available only in three and higher dimensions, but not in one or two dimensions; so presumably a finitary analogue of this paradox should also make the same distinction between low and high dimensions.

I therefore set myself the exercise of trying to phrase Cantor’s theorem and the Banach-Tarski paradox in a more “finitary” language. It seems that the easiest way to accomplish this is to avoid the use of set theory, and replace sets by some other concept. Taking inspiration from theoretical computer science, I decided to replace concepts such as functions and sets by the concepts of algorithms and oracles instead, with various constructions in set theory being replaced instead by computer language pseudocode. The point of doing this is that one can now add a new parameter to the universe, namely the amount of computational resources one is willing to allow one’s algorithms to use. At one extreme, one can enforce a “strict finitist” viewpoint where the total computational resources available (time and memory) are bounded by some numerical constant, such as {10^{100}}; roughly speaking, this causes any mathematical construction to break down once its complexity exceeds this number. Or one can take the slightly more permissive “finitist” or “constructivist” viewpoint, where any finite amount of computational resource is permitted; or one can then move up to allowing any construction indexed by a countable ordinal, or the storage of any array of countable size. Finally one can allow constructions indexed by arbitrary ordinals (i.e. transfinite induction) and arrays of arbitrary infinite size, at which point the theory becomes more or less indistinguishable from standard set theory.

I describe this viewpoint, and how statements such as Cantor’s theorem and Banach-Tarski are interpreted with this viewpoint, below the fold. I should caution that this is a conceptual exercise rather than a rigorous one; I have not attempted to formalise these notions to the same extent that set theory is formalised. Thus, for instance, I have no explicit system of axioms that algorithms and oracles are supposed to obey. Of course, these formal issues have been explored in great depth by logicians over the past century or so, but I do not wish to focus on these topics in this post.

A second caveat is that the actual semantic content of this post is going to be extremely low. I am not going to provide any genuinely new proof of Cantor’s theorem, or give a new construction of Banach-Tarski type; instead, I will be reformulating the standard proofs and constructions in a different language. Nevertheless I believe this viewpoint is somewhat clarifying as to the nature of these paradoxes, and as to how they are not as fundamentally tied to the nature of sets or the nature of infinity as one might first expect.

Read the rest of this entry »

In this final set of lecture notes for this course, we leave the realm of self-adjoint matrix ensembles, such as Wigner random matrices, and consider instead the simplest examples of non-self-adjoint ensembles, namely the iid matrix ensembles. (I had also hoped to discuss recent progress in eigenvalue spacing distributions of Wigner matrices, but have run out of time. For readers interested in this topic, I can recommend the recent Bourbaki exposé of Alice Guionnet.)
The basic result in this area is

Theorem 1 (Circular law) Let {M_n} be an {n \times n} iid matrix, whose entries {\xi_{ij}}, {1 \leq i,j \leq n} are iid with a fixed (complex) distribution {\xi_{ij} \equiv \xi} of mean zero and variance one. Then the spectral measure {\mu_{\frac{1}{\sqrt{n}} M_n}} converges both in probability and almost surely to the circular law {\mu_{circ} := \frac{1}{\pi} 1_{|x|^2+|y|^2 \leq 1}\ dx dy}, where {x, y} are the real and imaginary coordinates of the complex plane.

This theorem has a long history; it is analogous to the semi-circular law, but the non-Hermitian nature of the matrices makes the spectrum so unstable that key techniques that are used in the semi-circular case, such as truncation and the moment method, no longer work; significant new ideas are required. In the case of random gaussian matrices, this result was established by Mehta (in the complex case) and by Edelman (in the real case), as was sketched out in Notes. In 1984, Girko laid out a general strategy for establishing the result for non-gaussian matrices, which formed the base of all future work on the subject; however, a key ingredient in the argument, namely a bound on the least singular value of shifts {\frac{1}{\sqrt{n}} M_n - zI}, was not fully justified at the time. A rigorous proof of the circular law was then established by Bai, assuming additional moment and boundedness conditions on the individual entries. These additional conditions were then slowly removed in a sequence of papers by Gotze-Tikhimirov, Girko, Pan-Zhou, and Tao-Vu, with the last moment condition being removed in a paper of myself, Van Vu, and Manjunath Krishnapur.
At present, the known methods used to establish the circular law for general ensembles rely very heavily on the joint independence of all the entries. It is a key challenge to see how to weaken this joint independence assumption.
Read the rest of this entry »

In harmonic analysis and PDE, one often wants to place a function f: {\bf R}^d \to {\bf C} on some domain (let’s take a Euclidean space {\bf R}^d for simplicity) in one or more function spaces in order to quantify its “size” in some sense.  Examples include

  • The Lebesgue spaces L^p of functions f whose norm \|f\|_{L^p} := (\int_{{\bf R}^d} |f|^p)^{1/p} is finite, as well as their relatives such as the weak L^p spaces L^{p,\infty} (and more generally the Lorentz spaces L^{p,q}) and Orlicz spaces such as L \log L and e^L;
  • The classical regularity spaces C^k, together with their Hölder continuous counterparts C^{k,\alpha};
  • The Sobolev spaces W^{s,p} of functions f whose norm \|f\|_{W^{s,p}} = \|f\|_{L^p} + \| |\nabla|^s f\|_{L^p} is finite (other equivalent definitions of this norm exist, and there are technicalities if s is negative or p \not \in (1,\infty)), as well as relatives such as homogeneous Sobolev spaces \dot W^{s,p}, Besov spaces B^{s,p}_q, and Triebel-Lizorkin spaces F^{s,p}_q.  (The conventions for the superscripts and subscripts here are highly variable.)
  • Hardy spaces {\mathcal H}^p, the space BMO of functions of bounded mean oscillation (and the subspace VMO of functions of vanishing mean oscillation);
  • The Wiener algebra A;
  • Morrey spaces M^p_q;
  • The space M of finite measures;
  • etc., etc.

As the above partial list indicates, there is an entire zoo of function spaces one could consider, and it can be difficult at first to see how they are organised with respect to each other.  However, one can get some clarity in this regard by drawing a type diagram for the function spaces one is trying to study.  A type diagram assigns a tuple (usually a pair) of relevant exponents to each function space.  For function spaces X on Euclidean space, two such exponents are the regularity s of the space, and the integrability p of the space.  These two quantities are somewhat fuzzy in nature (and are not easily defined for all possible function spaces), but can basically be described as follows.  We test the function space norm \|f\|_X of a modulated rescaled bump function

f(x) := A e^{i x \cdot \xi} \phi( \frac{x-x_0}{R} ) (1)

where A > 0 is an amplitude, R > 0 is a radius, \phi \in C^\infty_c({\bf R}^d) is a test function, x_0 is a position, and \xi \in {\bf R}^d is a frequency of some magnitude |\xi| \sim N.  One then studies how the norm \|f\|_X depends on the parameters A, R, N.  Typically, one has a relationship of the form

\|f\|_X \sim A N^s R^{d/p} (2)

for some exponents s, p, at least in the high-frequency case when N is large (in particular, from the uncertainty principle it is natural to require N \gtrsim 1/R, and when dealing with inhomogeneous norms it is also natural to require N \gtrsim 1).  The exponent s measures how sensitive the X norm is to oscillation, and thus controls regularity; if s is large, then oscillating functions will have large X norm, and thus functions in X will tend not to oscillate too much and thus be smooth.    Similarly, the exponent p measures how sensitive the X norm is to the function f spreading out to large scales; if p is small, then slowly decaying functions will have large norm, so that functions in X tend to decay quickly; conversely, if p is large, then singular functions will tend to have large norm, so that functions in X will tend to not have high peaks.

Note that the exponent s in (2) could be positive, zero, or negative, however the exponent p should be non-negative, since intuitively enlarging R should always lead to a larger (or at least comparable) norm.  Finally, the exponent in the A parameter should always be 1, since norms are by definition homogeneous.  Note also that the position x_0 plays no role in (1); this reflects the fact that most of the popular function spaces in analysis are translation-invariant.

The type diagram below plots the s, 1/p indices of various spaces.  The black dots indicate those spaces for which the s, 1/p indices are fixed; the blue dots are those spaces for which at least one of the s, 1/p indices are variable (and so, depending on the value chosen for these parameters, these spaces may end up in a different location on the type diagram than the typical location indicated here).

(There are some minor cheats in this diagram, for instance for the Orlicz spaces L \log L and e^L one has to adjust (1) by a logarithmic factor.   Also, the norms for the Schwartz space {\mathcal S} are not translation-invariant and thus not perfectly describable by this formalism. This picture should be viewed as a visual aid only, and not as a genuinely rigorous mathematical statement.)

The type diagram can be used to clarify some of the relationships between function spaces, such as Sobolev embedding.  For instance, when working with inhomogeneous spaces (which basically identifies low frequencies N \ll 1 with medium frequencies N \sim 1, so that one is effectively always in the regime N \gtrsim 1), then decreasing the s parameter results in decreasing the right-hand side of (1).  Thus, one expects the function space norms to get smaller (and the function spaces to get larger) if one decreases s while keeping p fixed.  Thus, for instance, W^{k,p} should be contained in W^{k-1,p}, and so forth.  Note however that this inclusion is not available for homogeneous function spaces such as \dot W^{k,p}, in which the frequency parameter N can be either much larger than 1 or much smaller than 1.

Similarly, if one is working in a compact domain rather than in {\bf R}^d, then one has effectively capped the radius parameter R to be bounded, and so we expect the function space norms to get smaller (and the function spaces to get larger) as one increases 1/p, thus for instance L^2 will be contained in L^1.  Conversely, if one is working in a discrete domain such as {\Bbb Z}^d, then the radius parameter R has now effectively been bounded from below, and the reverse should occur: the function spaces should get larger as one decreases 1/p.  (If the domain is both compact and discrete, then it is finite, and on a finite-dimensional space all norms are equivalent.)

As mentioned earlier, the uncertainty principle suggests that one has the restriction N \gtrsim 1/R.  From this and (2), we expect to be able to enlarge the function space by trading in the regularity parameter s for the integrability parameter p, keeping the dimensional quantity d/p - s fixed.  This is indeed how Sobolev embedding works.   Note in some cases one runs out of regularity before p goes all the way to infinity (thus ending up at an L^p space), while in other cases p hits infinity first.  In the latter case, one can embed the Sobolev space into a Holder space such as C^{k,\alpha}.

On continuous domains, one can send the frequency N off to infinity, keeping the amplitude A and radius R fixed.  From this and (1) we see that norms with a lower regularity s can never hope to control norms with a higher regularity s' > s, no matter what one does with the integrability parameter.   Note however that in discrete settings this obstruction disappears; when working on, say, {\bf Z}^d, then in fact one can gain as much regularity as one wishes for free, and there is no distinction between a Lebesgue space \ell^p and their Sobolev counterparts W^{k,p} in such a setting.

When interpolating between two spaces (using either the real or complex interpolation method), the interpolated space usually has regularity and integrability exponents on the line segment between the corresponding exponents of the endpoint spaces.  (This can be heuristically justified from the formula (2) by thinking about how the real or complex interpolation methods actually work.)  Typically, one can control the norm of the interpolated space by the geometric mean of the endpoint norms that is indicated by this line segment; again, this is plausible from looking at (2).

The space L^2 is self-dual.  More generally, the dual of a function space X will generally have type exponents that are the reflection of the original exponents around the L^2 origin.  Consider for instance the dual spaces H^s, H^{-s} or {\mathcal H}^1, BMO in the above diagram.

Spaces whose integrability exponent p is larger than 1 (i.e. which lie to the left of the dotted line) tend to be Banach spaces, while spaces whose integrability exponent is less than 1 are almost never Banach spaces.  (This can be justified by covering a large ball into small balls and considering how (1) would interact with the triangle inequality in this case).  The case p=1 is borderline; some spaces at this level of integrability, such as L^1, are Banach spaces, while other spaces, such as L^{1,\infty}, are not.

While the regularity s and integrability p are usually the most important exponents in a function space (because amplitude, width, and frequency are usually the most important features of a function in analysis), they do not tell the entire story.  One major reason for this is that the modulated bump functions (1), while an important class of test examples of functions, are by no means the only functions that one would wish to study.  For instance, one could also consider sums of bump functions (1) at different scales.  The behaviour of the function space norms on such spaces is often controlled by secondary exponents, such as the second exponent q that arises in Lorentz spaces, Besov spaces, or Triebel-Lizorkin spaces.  For instance, consider the function

f_M(x) := \sum_{m=1}^M 2^{-md} \phi(x/2^m), (3)

where M is a large integer, representing the number of distinct scales present in f_M.  Any function space with regularity s=0 and p=1 should assign each summand 2^{-md} \phi(x/2^m) in (3) a norm of O(1), so the norm of f_M could be as large as O(M) if one assumes the triangle inequality.  This is indeed the case for the L^1 norm, but for the weak L^1 norm, i.e. the L^{1,\infty} norm,  f_M only has size O(1).  More generally, for the Lorentz spaces L^{1,q}, f_M will have a norm of about O(M^{1/q}).   Thus we see that such secondary exponents can influence the norm of a function by an amount which is polynomial in the number of scales.  In many applications, though, the number of scales is a “logarithmic” quantity and thus of lower order interest when compared against the “polynomial” exponents such as s and p.  So the fine distinctions between, say, strong L^1 and weak L^1, are only of interest in “critical” situations in which one cannot afford to lose any logarithmic factors (this is for instance the case in much of Calderon-Zygmund theory).

We have cheated somewhat by only working in the high frequency regime.  When dealing with inhomogeneous spaces, one often has a different set of exponents for (1) in the low-frequency regime than in the high-frequency regime.  In such cases, one sometimes has to use a more complicated type diagram to  genuinely model the situation, e.g. by assigning to each space a convex set of type exponents rather than a single exponent, or perhaps having two separate type diagrams, one for the high frequency regime and one for the low frequency regime.   Such diagrams can get quite complicated, and will probably not be much use to a beginner in the subject, though in the hands of an expert who knows what he or she is doing, they can still be an effective visual aid.

Starting on Monday, March 29, I will begin my graduate class for the winter quarter, entitled “Higher order Fourier analysis“.  While classical Fourier analysis is concerned with correlations with linear phases such as x \mapsto e(\alpha x) (where e(x) := e^{2\pi i x}), quadratic and higher order Fourier analysis is concerned with quadratic and higher order phases such as x \mapsto e(\alpha x^2), x \mapsto e(\alpha x^3), etc.

In recent years, it has become clear that certain problems in additive combinatorics are naturally associated with a certain order of Fourier analysis.  For instance, problems involving arithmetic progressions of length three are connected with classical Fourier analysis; problems involving progressions of length four are connected with quadratic Fourier analysis; problems involving progressions of length five are connected with cubic Fourier analysis; and so forth.  The reasons for this will be discussed later in the course, but we will just give one indication of the connection here: linear phases x \mapsto e(\alpha x) and arithmetic progressions n, n+r, n+2r of length three are connected by the identity

e(\alpha n) e(\alpha(n+r))^{-2} e(\alpha(n+2r)) = 1,

while quadratic phases x \mapsto e(\alpha x^2) and arithmetic progressions n, n+r, n+2r, n+3r of length four are connected by the identity

e(\alpha n^2) e(\alpha(n+r)^2)^{-3} e(\alpha(n+2r)^2)^3 e(\alpha(n+3r)^2)^{-1} = 1,

and so forth.

It turns out that in order to get a complete theory of higher order Fourier analysis, the simple polynomial phases of the type given above do not suffice.  One must also consider more exotic objects such as locally polynomial phases, bracket polynomial phases (such as n \mapsto e( \lfloor \alpha n \rfloor \beta n ), and/or nilsequences (sequences arising from an orbit in a nilmanifold G/\Gamma).  These (closely related) families of objects will be introduced later in the course.

Classical Fourier analysis revolves around the Fourier transform and the inversion formula.  Unfortunately, we have not yet been able to locate similar identities in the higher order setting, but one can establish weaker results, such as higher order structure theorems and arithmetic regularity lemmas, which are sufficient for many purposes, such as proving Szemeredi’s theorem on arithmetic progressions, or my theorem with Ben Green that the primes contain arbitrarily long arithmetic progressions.  These results are powered by the inverse conjecture for the Gowers norms, which is now extremely close to being fully resolved.

Our focus here will primarily be on the finitary approach to the subject, but there is also an important infinitary aspect to the theory, originally coming from ergodic theory but more recently from nonstandard analysis (or more precisely, ultralimit analysis) as well; we will touch upon these perspectives in the course, though they will not be the primary focus.  If time permits, we will also present the number-theoretic applications of this machinery to counting arithmetic progressions and other linear patterns in the primes.

Now we turn attention to another important spectral statistic, the least singular value {\sigma_n(M)} of an {n \times n} matrix {M} (or, more generally, the least non-trivial singular value {\sigma_p(M)} of a {n \times p} matrix with {p \leq n}). This quantity controls the invertibility of {M}. Indeed, {M} is invertible precisely when {\sigma_n(M)} is non-zero, and the operator norm {\|M^{-1}\|_{op}} of {M^{-1}} is given by {1/\sigma_n(M)}. This quantity is also related to the condition number {\sigma_1(M)/\sigma_n(M) = \|M\|_{op} \|M^{-1}\|_{op}} of {M}, which is of importance in numerical linear algebra. As we shall see in the next set of notes, the least singular value of {M} (and more generally, of the shifts {\frac{1}{\sqrt{n}} M - zI} for complex {z}) will be of importance in rigorously establishing the circular law for iid random matrices {M}, as it plays a key role in computing the Stieltjes transform {\frac{1}{n} \hbox{tr} (\frac{1}{\sqrt{n}} M - zI)^{-1}} of such matrices, which as we have already seen is a powerful tool in understanding the spectra of random matrices.

The least singular value

\displaystyle  \sigma_n(M) = \inf_{\|x\|=1} \|Mx\|,

which sits at the “hard edge” of the spectrum, bears a superficial similarity to the operator norm

\displaystyle  \|M\|_{op} = \sigma_1(M) = \sup_{\|x\|=1} \|Mx\|

at the “soft edge” of the spectrum, that was discussed back in Notes 3, so one may at first think that the methods that were effective in controlling the latter, namely the epsilon-net argument and the moment method, would also work to control the former. The epsilon-net method does indeed have some effectiveness when dealing with rectangular matrices (in which the spectrum stays well away from zero), but the situation becomes more delicate for square matrices; it can control some “low entropy” portions of the infimum that arise from “structured” or “compressible” choices of {x}, but are not able to control the “generic” or “incompressible” choices of {x}, for which new arguments will be needed. As for the moment method, this can give the coarse order of magnitude (for instance, for rectangular matrices with {p=yn} for {0 < y < 1}, it gives an upper bound of {(1-\sqrt{y}+o(1))n} for the singular value with high probability, thanks to the Marchenko-Pastur law), but again this method begins to break down for square matrices, although one can make some partial headway by considering negative moments such as {\hbox{tr} M^{-2}}, though these are more difficult to compute than positive moments {\hbox{tr} M^k}.

So one needs to supplement these existing methods with additional tools. It turns out that the key issue is to understand the distance between one of the {n} rows {X_1,\ldots,X_n \in {\bf C}^n} of the matrix {M}, and the hyperplane spanned by the other {n-1} rows. The reason for this is as follows. First suppose that {\sigma_n(M)=0}, so that {M} is non-invertible, and there is a linear dependence between the rows {X_1,\ldots,X_n}. Thus, one of the {X_i} will lie in the hyperplane spanned by the other rows, and so one of the distances mentioned above will vanish; in fact, one expects many of the {n} distances to vanish. Conversely, whenever one of these distances vanishes, one has a linear dependence, and so {\sigma_n(M)=0}.

More generally, if the least singular value {\sigma_n(M)} is small, one generically expects many of these {n} distances to be small also, and conversely. Thus, control of the least singular value is morally equivalent to control of the distance between a row {X_i} and the hyperplane spanned by the other rows. This latter quantity is basically the dot product of {X_i} with a unit normal {n_i} of this hyperplane.

When working with random matrices with jointly independent coefficients, we have the crucial property that the unit normal {n_i} (which depends on all the rows other than {X_i}) is independent of {X_i}, so even after conditioning {n_i} to be fixed, the entries of {X_i} remain independent. As such, the dot product {X_i \cdot n_i} is a familiar scalar random walk, and can be controlled by a number of tools, most notably Littlewood-Offord theorems and the Berry-Esséen central limit theorem. As it turns out, this type of control works well except in some rare cases in which the normal {n_i} is “compressible” or otherwise highly structured; but epsilon-net arguments can be used to dispose of these cases. (This general strategy was first developed for the technically simpler singularity problem by Komlós, and then extended to the least singular value problem by Rudelson.)

These methods rely quite strongly on the joint independence on all the entries; it remains a challenge to extend them to more general settings. Even for Wigner matrices, the methods run into difficulty because of the non-independence of some of the entries (although it turns out one can understand the least singular value in such cases by rather different methods).

To simplify the exposition, we shall focus primarily on just one specific ensemble of random matrices, the Bernoulli ensemble {M = (\xi_{ij})_{1 \leq i,j \leq n}} of random sign matrices, where {\xi_{ij} = \pm 1} are independent Bernoulli signs. However, the results can extend to more general classes of random matrices, with the main requirement being that the coefficients are jointly independent.

Read the rest of this entry »

Van Vu and I have just uploaded to the arXiv our joint paper “The Littlewood-Offord problem in high dimensions and a conjecture of Frankl and Füredi“. In this short paper we give a different proof of a high-dimensional Littlewood-Offord result of Frankl and Füredi, and in the process also affirmatively answer one of their open problems.

Let {v_1,\ldots,v_n} be {n} vectors in {{\mathbb R}^d}, which we normalise to all have length at least {1}. For any given radius {\Delta > 0}, we consider the small ball probability

\displaystyle  p(v_1,\ldots,v_n,\Delta) := \sup_B {\bf P}( \eta_1 v_1 + \ldots + \eta_n v_n \in B )

where {\eta_1,\ldots,\eta_n} are iid Bernoulli signs (i.e. they take values {+1} or {-1} independently with a probability of {1/2} of each), and {B} ranges over all (closed) balls of radius {\Delta}. The Littlewood-Offord problem is to compute the quantity

\displaystyle  p_d(n,\Delta) := \sup_{v_1,\ldots,v_n} p(v_1,\ldots,v_n,\Delta)

where {v_1,\ldots,v_n} range over all vectors in {{\mathbb R}^d} of length at least one. Informally, this number measures the extent to which a random walk of length {n} (with all steps of size at least one) can concentrate into a ball of radius {\Delta}.

The one-dimensional case of this problem was answered by Erdös. First, one observes that one can normalise all the {v_i} to be at least {+1} (as opposed to being at most {-1}). In the model case when {\Delta < 1}, he made the following simple observation: if a random sum {\eta_1 v_1 + \ldots + \eta_n v_n} fell into a ball of radius {\Delta} (which in the one-dimensional case, is an interval of length less than {2}), and one then changed one or more of the signs {\eta_i} from {-1} to {+1}, then the new sum must necessarily lie outside of the ball. In other words, for any ball {B} of radius {\Delta}, the set of signs {(\eta_1,\ldots,\eta_n) \in \{-1,+1\}^n} for which {\eta_1 v_1 + \ldots + \eta_n v_n \in B} forms an antichain. Applying Sperner’s theorem, the maximal size of this antichain is {\binom{n}{\lfloor n/2\rfloor}}, and this soon leads to the exact value

\displaystyle  p_1(n,\Delta) = \binom{n}{\lfloor n/2\rfloor}/2^n = \frac{\sqrt{\frac{2}{\pi}}+o(1)}{\sqrt{n}}

when {0 \leq \Delta < 1} (the bound is attained in the extreme case {v_1=\ldots=v_n=1}).

A similar argument works for higher values of {\Delta}, using Dilworth’s theorem instead of Sperner’s theorem, and gives the exact value

\displaystyle  p_1(n,\Delta) = \sum_{j=1}^s \binom{n}{m_j}/2^n = \frac{s\sqrt{\frac{2}{\pi}}+o(1)}{\sqrt{n}}

whenever {n \geq s} and {s-1 \leq \Delta < s} for some natural number {s}, where {\binom{n}{m_1},\ldots,\binom{n}{m_s}} are the {s} largest binomial coefficients of {\binom{n}{1}, \ldots, \binom{n}{n}}.

Now consider the higher-dimensional problem. One has the obvious bound

\displaystyle  p_d(n,\Delta) \geq p_1(n,\Delta),

but it is not obvious whether this inequality is strict. In other words, is there some way to exploit the additional freedom given by higher dimensions to make random walks concentrate more than in the one-dimensional case?

For some values of {\Delta}, it turns out that the answer is no, as was first observed by Kleitman (and discussed further by Frankl and Füredi). Suppose for instance that

\displaystyle  \sqrt{(s-1)^2+1} \leq \Delta < s

for some {s \geq 2}. Then one can consider the example in which {v_1=\ldots=v_{n-1}=e_1} is one unit vector, and {v_n=e_2} is another unit vector orthogonal to {e_1}. The small ball probability in this case can be computed to equal {p_1(n-1,s-1)} rather than {p_1(n,s-1)}, which is slightly larger.

In the positive direction, Frankl and Füredi established the asymptotic

\displaystyle  p_d(n,\Delta) = (1 + o(1)) p_1(n,\Delta) \ \ \ \ \ (1)

as {n \rightarrow \infty} (holding {d} and {\Delta} fixed). Furthermore, if {\Delta} was close to an integer, and more precisely if

\displaystyle  s-1 \leq \Delta < s-1 + \frac{1}{10s^2}

(so that the above counterexample can be avoided) they showed that {p_d(n,\Delta) = p_1(n,\Delta)} for sufficiently large {n} (depending on {s,\Delta}).

The factor {\frac{1}{10s^2}} was an artefact of their method, and they conjectured in fact that one should have {p_d(n,\Delta) = p_1(n,\Delta)} for sufficiently large {n} whenever

\displaystyle  s-1 \leq \Delta < \sqrt{(s-1)^2+1} \ \ \ \ \ (2)

thus matching the counterexample exactly. This conjecture was verified for {s = 1} by Kleitman and for {s=2,3} by Frankl and Füredi.

In this paper we verify the conjecture of Frankl and Füredi (and give a new proof of their asymptotic (1)). Our main tool is the following high-dimensional Littlewood-Offord inequality:

Theorem 1 Suppose that {v_1,\ldots,v_n \in {\mathbb R}^d} which is genuinely {d}-dimensional in the sense that for any hyperplane {H} going through the origin, one has {\hbox{dist}(v_i,H) \geq 1} for at least {k} values of {i}. Then one has

\displaystyle  p(v_1,\ldots,v_n,\Delta) \ll_{d,\Delta} k^{-d/2}.

Theorem 1 can be viewed as a high-dimensional variant of Erdös’s inequality (but without the sharp upper bound). It is proven by the Fourier-analytic method of Halász. (This theorem was announced in my book with Van Vu several years ago, but we did not get around to publishing it until now.)

Using Theorem 1, one can verify the conjecture of Frankl and Füredi fairly quickly (the deduction takes a little over a page). The main point is that if there is excessive concentration, then Theorem 1 quickly places almost all of the vectors {v_1,\ldots,v_n} to lie very close to a line. If all the vectors are close to a line, then we can project onto this line and rescale, which causes {\Delta} to worsen a little bit in this reduction to the one-dimensional case, but it turns out that the bounds (2) allow us to tolerate this degradation of {\Delta} once {s>3} (so it is fortunate that the cases {s \leq 3} were already done for us!). If instead we have a vector far from the line (as is the case in the key counterexample), then we manually eliminate that vector using the parallelogram law, which effectively drops {\Delta} below {s-1} (half of the time, at least) if {\Delta} was initially less than {\sqrt{(s-1)^2+1}}, which gives enough of a saving to conclude the argument.

One moral that one can draw from this argument is that one can use a quasi-sharp estimate (such as Theorem 1), which ostensibly loses constant factors, to then deduce a sharp estimate (such as the Frankl-Furëdi conjecture) that loses no constant factors, as long as one is in an asymptotic regime (in this case, {s \geq 3} and {n} large depending on {d,\Delta}). The key is to exploit the fine structure in the main term (in this case, the piecewise constant nature of {p_1(n,\Delta)} when {\Delta} passes over integers) to extract gains that can absorb the losses coming from the quasi-sharp estimate).