The first Distinguished Lecture Series at UCLA of this academic year is being given this week by my good friend and fellow Medalist Charlie Fefferman, who also happens to be my “older brother” (we were both students of Elias Stein). The theme of Charlie’s lectures is “Interpolation of functions on ${\Bbb R}^n$“, in the spirit of the classical Whitney extension theorem, except that now one is considering much more quantitative and computational extension problems (in particular, viewing the problem from a theoretical computer science perspective). Today Charlie introduced the basic problems in this subject, and stated some of the results of his joint work with Bo’az Klartag; he will continue the lectures on Thursday and Friday.

The general topic of extracting quantitative bounds from classical qualitative theorems is a subject that I am personally very fond of, and Charlie gave a wonderfully accessible presentation of the main results, though the actual details of the proofs were left to the next two lectures.

As usual, all errors and omissions here are my responsibility, and are not due to Charlie.

The original extension question addressed by Whitney (with further work by many others including Glaeser, Brudnyi-Shvartsman, Zobin, Bierstone-Milman-Pawlucki, Fefferman, and Brudnyi-Brudnyi) is the following. Suppose one is given a subset E of a given Euclidean space ${\Bbb R}^n$, and suppose one also given a function $f: E \to {\Bbb R}$, as well as a regularity m (which is a non-negative integer). Is it then possible to extend f to a globally defined function $F: {\Bbb R}^n \to {\Bbb R}$ (thus $F(x)=f(x)$ for all $x \in E$), which lies in the regularity class $C^m({\Bbb R}^n)$ of m-times continuously differentiable functions?

1. The larger E is, the harder it is to obtain an extension. For instance, if E is finite, then an easy application of the Lagrange interpolation formula shows that any function $f: E \to {\Bbb R}$ can be extended to a smooth function (or even a polynomial) $F: {\Bbb R}^n \to {\Bbb R}$. At the other extreme, if $E = {\Bbb R}^n$, then of course f can be extended to a $C^m$ function F if and only if f already has bounded continuous derivatives up to $m^{th}$ order.
2. On the other hand, it is easy to reduce to the case when E is bounded; if E is unbounded and one can solve the extension problem for the bounded sets $E \cap B(0,R)$ for every radius R (and for the restriction of f to these sets), then it is not difficult (by use of partitions of unity) to show that the extension problem can be solved for the unbounded set E.
3. One can reduce further to the case when E is compact, because if the values of a continuous function $F: {\Bbb R}^n \to {\Bbb R}$ is specified on a set E, this uniquely specifies F on the closure $\overline{E}$.

The case m=0 of this problem is answered by the Tietze extension theorem. For higher m, the problem is partially answered by the Whitney extension theorem (of which more will be said in subsequent talks).

Charlie’s talk is concerned with more “finitary” or “quantitative” versions of the above problem. Now E is a finite subset of ${\Bbb R}^n$, with cardinality |E|=N (say), so that $f: E \to {\Bbb R}$ is now simply a list of N real numbers. (One should think of n and m as being fixed and relatively small, but N as being extremely large.) As remarked above, it is now very easy to extend f to an m-times continuously differentiable function $F: {\Bbb R}^n \to {\Bbb R}$, and one can even make F boundedly continuously differentiable, thus the norm

$\|F\|_{C^m({\Bbb R}^n)} := \sup_{0 \leq j \leq m} \sup_{x \in {\Bbb R^n}} |\nabla^j F(x)|$

is finite. However, one can now ask some quantitative questions, which we will first state in a rather informal fashion:

1. Given the data n, m, E, f, what is the best possible value of $\|F\|_{C^m({\Bbb R}^n)}$? Can this value (or some approximation to this value) be “computed” efficiently?
2. Given the data n, m, E, f, can one “compute” an extension F of f whose norm $\|F\|_{C^m({\Bbb R}^n)}$ is close to optimal?
3. Can the extension operation $f \mapsto F$ be chosen to be linear?
4. How does one answer Q1, Q2, Q3 if we allow the use of approximate extensions $F: {\Bbb R}^n \to {\Bbb R}$, in which F(x) is only required to be close to f(x) for $x \in E$, rather than equal to f(x), with some error tolerance $\sigma(x) \geq 0$ which may depend on x?
5. How does one answer Q1, Q2, Q3 if we also are allowed to discard a few elements of E (so that F(x) is now only required to be equal or close to f(x) outside of a small exceptional set)?

Such questions are of relevance in computer science problems such as machine learning, in which an unknown smooth function $F: {\Bbb R}^n \to {\Bbb R}$ is “learned” by sampling its values at a finite number of points $x \in E$, but where the observed values f(x) at x can differ from the true value F(x) by some experimental error $\sigma(x)$ (and in some small number of cases, f(x) may be completely incorrect). The problem is then how to interpolate F from this noisy (and occasionally completely wrong) data. Such problems are of course also related to the more classical problem of regression in statistics, but with the difference that the function F is only assumed to be smooth (in some quantitative sense), rather than being linear, polynomial, or otherwise being determined by a small number of parameters.

The above questions are of course rather imprecise. To make them more precise, we need to specify what “compute” means. For simplicity, let us take a computational model in which one has a certain amount of memory, say M registers $x_1,\ldots,x_M$, with each register containing an infinite-precision real number. We then consider performing a number of operations on these registers, where each operation involves only a small number (e.g. two or three) of the registers and performs a simple operation (e.g. adding or multiplying the values of two registers and placing the result in a third). [The infinite precision model is not realistic, of course, but Charlie remarked that provided that one can tolerate a sufficient amount of boredom, one can state and prove analogues of the results stated below for finite precision computational models.]

A computer with M registers can certainly store all the data n, m, E, f (and sometimes $\sigma$) needed for the above problems, if M is larger than some constant multiple of N. But what can it mean to “compute” a function $F: {\Bbb R}^n \to {\Bbb R}$? The domain of this function is infinite and so it cannot be “stored” in this finite-memory computer simply by storing each of its values separately. So instead, we shall say that F can be computed on a computer with M registers with an amount W of one-time work and an amount Q of query work if, after W computations, the computer has reached a state in which, after being given any input x, the computer can output F(x) (to infinite accuracy) after at most Q computations.

To state the theorems of Fefferman and Klartag we need a little more notation. Given data n, m, E, f, as well as an error tolerance function $\sigma: E \to {\Bbb R}^+$, let us define the quantity $\|f\|_{(E,\sigma)}$ to be the best constant M such that one can find a function $F \in C^m({\Bbb R}^n)$ with $\|F\|_{C^m({\Bbb R}^n)} \leq M$, which approximates f in the sense that $|F(x)-f(x)| \leq M \sigma(x)$ for all $x \in E$. This quantity $\|f\|_{(E,\sigma)}$ is then a norm on f (for fixed E and $\sigma$), and measures how well one can fit a reasonably smooth function F to the data f with error tolerance $\sigma$. (If one wants to fit F to f perfectly, one can simply set $\sigma = 0$.) If a function F obeys the above properties with an M which is comparable in magnitude to the optimal value $\|f\|_{(E,\sigma)}$, thus $M = O( \|f\|_{(E,\sigma)} )$, then we say that F is an essentially optimal interpolant of f. (Here and in the sequel, all implied constants in the O() notation are allowed to depend on the dimension n and the regularity m, but are uniform in N, E, f, $\sigma$.)

Theorem 1. If n, m are fixed, and E, f, $\sigma$ are given with |E|=N, then one can compute the order of magnitude of $\|f\|_{(E,\sigma)}$ using only $O(N \log N)$ computations and $O(N)$ amount of memory. [More precisely, one can use these computational resources to produce a quantity Y which differs from $\|f\|_{(E,\sigma)}$ by at most a multiplicative constant.]

Theorem 2. With the same setup as Theorem 1, one can compute an essentially optimal function F using only $O(N \log N)$ one-time work, $O(N)$ memory, and $O( \log N )$ query work. Furthermore, for fixed n, m, E, $\sigma$, the map $f \mapsto F$ can be chosen to be linear!

Apart from the factors of $\log N$, these bounds are optimal, as one needs O(N) amount of work and memory merely to read all the data. (From the way we have set up the problem, one cannot hope to obtain any meaningful results by sampling only a small portion of the data.)

Now let us consider the problem of what happens if we are willing to discard k points (say) from E in order to improve the interpolant. To this end, let us define the quantity

$\|f\|_{(E,\sigma,k)} := \inf \{ \|f\|_{(E \backslash S, \sigma)}: S \subset E; |S| \leq k\}$

to denote the quality of the best interpolant one obtains from the data f after being allowed to reject k data points S. The key problem, of course, is to work out which sets S are good to reject; once S is identified, one can use Theorems 1 and 2 to compute the quantity $\|f\|_{(E,\sigma,k)}$, as well as a good interpolant F. Let us say that a set S is an essentially optimal rejection set if we have

$\|f\|_{(E \backslash S, \sigma)} = O( \|f\|_{(E,\sigma,c|S|)} )$

for some constant c > 0 (depending only on n and m); this means that throwing out S gives you about as good a fit as you can hope to accomplish with the remaining data, apart from the fact that you may have thrown out a bit more data (by a multiplicative factor) than you really needed to.

Theorem 3. With the same setup as Theorem 1, one can compute an ordering $\{x_1,\ldots,x_N\}$ of E in $O(N^2 \log N)$ computations and $O( N )$ memory, such that the set $\{x_1,\ldots,x_k\}$ is an essentially optimal rejection set for every $0 \leq k \leq N$.

Basically, this theorem says that there is a polynomial time algorithm to order the data from the most unreliable $x_1$ to the most reliable $x_N$, so that if you are willing to throw out about k points of data to get a good fit with a smooth model, then an essentially optimal way to do this is simply to throw out the k most unreliable points $x_1,\ldots,x_k$. Charlie believes that the $O(N^2 \log N)$ amount of work can be reduced, but this is not yet been established rigorously.

One of the keys to these results is a remarkable reduction of the problem to a local one:

Finiteness theorem. Let n, m, E, $\sigma$ be as before, with E finite. Then there exists a constant C (depending only on n,m) such that

$\|f\|_{(E,\sigma)} \sim \max \{ \|f\|_{(S,\sigma)}: S \subset E, |S| \leq C \}$

for all functions $f: E \to {\Bbb R}$, where $X \sim Y$ means that X and Y are equal up to multiplicative constants (i.e. X = O(Y) and Y = O(X)).

Informally, this is saying that if you want to estimate the extent to which one can fit a smooth function to N points of data, it is enough to look at O(1) points of data at a time and see how well one can fit a smooth function to that. Since a data fitting problem for O(1) points can be done in O(1) time (for instance, by using the methods of the Whitney extension theorem), this already gives a polynomial time version of Theorem 1, since there are only polynomially many sets S of size O(1). To give a nearly linear-time version, one needs the following refinement, which cuts down the number of sets required to be linear in N:

Better finiteness theorem. Let n, m, E, $\sigma$ be as before, with E finite. Then there exists a sequence $S_1, \ldots, S_L$ of subsets of S of size O(1) and number $L = O( N )$, and computable using $O(N \log N)$ computations and $O(N)$ memory, such that

$\|f\|_{(E,\sigma)} \sim \max \{ \|f\|_{(S_l,\sigma)}: 1 \leq l \leq S \}$

for all functions $f: E \to {\Bbb R}$, where $X \sim Y$ means that X and Y are equal up to multiplicative constants (i.e. X = O(Y) and Y = O(X)).

Charlie emphasised that the finiteness of E was important in all of these results; one could not use compactness methods to obtain similar claims for infinite sets E, basically because the space $C^m({\Bbb R}^n)$ has no good compactness properties. (On the other hand, the slightly weaker space $C^{m-1,1}({\Bbb R}^n)$ is fine. For instance, it is well known the optimal Lipschitz (or $C^{0,1}$) norm $\|F\|_{Lip({\Bbb R}^n)}$ of an extension F of a function f defined on a possibly infinite set E is equal to the Lipschitz norm $\|f\|_{Lip(E)}$ of f itself, or equivalently to the supremum of the optimal Lipschitz norm of the extension of F from two points of f, where the supremum is taken over all two-point subsets of E.)

Indeed, the finiteness theorem is false for infinite sets E; a good example is when m = 1, $E = {\Bbb R}^n$, and $f(x) := |x|$. Every finite restriction of f can be extended in $C^1$ with norm arbitrarily close to 1, but f itself can only be extended to be Lipschitz rather than $C^1$.

Nevertheless, there are meaningful analogues of the above theorems for infinite sets. For this, Charlie set up some notation. Given a point $x_0$ and a regularity m, let ${\mathcal P}_{x_0} := {\Bbb R}[x] / ((x-x_0)^{m+1})$ denote the ring of polynomials in n variables $x = (x_1,\ldots,x_n)$, in which every polynomial of degree m+1 or higher in $x-x_0$ is neglected. In particular, note that there is a ring homomorphism from $C^m({\Bbb R}^n)$ to ${\mathcal P}_{x_0}$ that sends any m-times continuously differentiable function f to its order m Taylor expansion $J_{x_0}(F)$ at $x_0$ (the letter J stands for jet). Given a compact set E, define a bundle to be a collection $(H_x)_{x \in E}$, where for each $x \in E$, $H_x$ is either empty, or is the coset of an ideal of ${\mathcal P}_{x_0}$. A function $f\in C^m({\Bbb R}^n)$ is said to be a section of a bundle $(H_x)_{x \in E}$ if we have $J_x(f) \in H_x$ for all $x \in E$. Charlie then posed the following basic question:

Question. Fix n, m, E. Which bundles $(H_x)_{x \in E}$ admit sections?

The special case of this question in which $H_x := \{ P \in {\mathcal P}_x: P(x)=f(x)\}$ for some given function $f: E \to {\Bbb R}$ is the classic extension problem of Whitney.

Charlie promised a good answer to this question, but as time was running out for this talk, he left it as a cliffhanger for the next talk. So stay tuned…

[Update, Nov 14: typos fixed.]