This post is in some ways an antithesis of my previous postings on hard and soft analysis. In those posts, the emphasis was on taking a result in soft analysis and converting it into a hard analysis statement (making it more “quantitative” or “effective”); here we shall be focusing on the reverse procedure, in which one harnesses the power of infinitary mathematics – in particular, ultrafilters and nonstandard analysis – to facilitate the proof of finitary statements.

Arguments in hard analysis are notorious for their profusion of “epsilons and deltas”. In the more sophisticated arguments of this type, one can end up having an entire army of epsilons $\epsilon_1, \epsilon_2, \epsilon_3, \ldots$ that one needs to manage, in particular choosing each epsilon carefully to be sufficiently small compared to other parameters (including other epsilons), while of course avoiding an impossibly circular situation in which a parameter is ultimately required to be small with respect to itself, which is absurd. This art of epsilon management, once mastered, is not terribly difficult – it basically requires one to mentally keep track of which quantities are “small”, “very small”, “very very small”, and so forth – but when these arguments get particularly lengthy, then epsilon management can get rather tedious, and also has the effect of making these arguments unpleasant to read. In particular, any given assertion in hard analysis usually comes with a number of unsightly quantifiers (For every $\epsilon$ there exists an N…) which can require some thought for a reader to parse. This is in contrast with soft analysis, in which most of the quantifiers (and the epsilons) can be cleanly concealed via the deployment of some very useful terminology; consider for instance how many quantifiers and epsilons are hidden within, say, the Heine-Borel theorem: “a subset of a Euclidean space is compact if and only if it is closed and bounded”.

For those who practice hard analysis for a living (such as myself), it is natural to wonder if one can somehow “clean up” or “automate” all the epsilon management which one is required to do, and attain levels of elegance and conceptual clarity comparable to those in soft analysis, hopefully without sacrificing too much of the “elementary” or “finitary” nature of hard analysis in the process.

One important step in this direction has been the development of various types of asymptotic notation, such as the Hardy notation of using unspecified constants C, the Landau notation of using O() and o(), or the Vinogradov notation of using symbols such as $\ll$ or $\lesssim$; each of these symbols, when properly used, absorbs one or more of the ambient quantifiers in a hard analysis statement, thus making these statements easier to read. But, as useful as these notations are, they still fall a little short of fully capturing one’s intuition regarding orders of magnitude. For instance, we tend to think of any quantity of the form O(1) as being “bounded”, and we know that bounded objects can be combined to form more bounded objects; for instance, if x=O(1) and y=O(1), then x+y=O(1) and xy=O(1). But if we attempt to formalise this by trying to create the set $A := \{ x \in {\Bbb R}: x = O(1) \}$ of all bounded numbers, and asserting that this set is then closed under addition and multiplication, we are speaking nonsense; the O() notation cannot be used within the axiom schema of specification, and so the above definition of A is meaningless.

There is however, a way to make concepts such as “the set of all bounded numbers” precise and meaningful, by using non-standard analysis, which is the most well-known of the “pseudo-finitary” approaches to analysis, in which one adjoins additional numbers to the standard number system. Similarly for “bounded” replaced by “small”, “polynomial size”, etc.. Now, in order to set up non-standard analysis one needs a (non-principal) ultrafilter (or an equivalent gadget), which tends to deter people from wanting to hear more about the subject. Because of this, most treatments of non-standard analysis tend to gloss over the actual construction of non-standard number systems, and instead emphasise the various benefits that these systems offer, such as a rigorous supply of infinitesimals, and a general transfer principle that allows one to convert statements in standard analysis into equivalent ones in non-standard analysis. This transfer principle (which requires the ultrafilter to prove) is usually recommended to be applied only at the very beginning and at the very end of an argument, so that the bulk of the argument is carried out purely in the non-standard universe.

I feel that one of the reasons that non-standard analysis is not embraced more widely is because the transfer principle, and the ultrafilter that powers it, is often regarded as some sort of “black box” which mysteriously bestows some certificate of rigour on non-standard arguments used to prove standard theorems, while conveying no information whatsoever on what the quantitative bounds for such theorems should be. Without a proper understanding of this black box, a mathematician may then feel uncomfortable with any non-standard argument, no matter how impressive and powerful the result.

The purpose of this post is to try to explain this black box from a “hard analysis” perspective, so that one can comfortably and productively transfer into the non-standard universe whenever it becomes convenient to do so (in particular, it can become cost-effective to do this whenever the burden of epsilon management becomes excessive, and one is willing to not make certain implied constants explicit).

— What is an ultrafilter? —

In order to do all this, we have to tackle head-on the notorious concept of a non-principal ultrafilter. Actually, these ultrafilters are not as impossible to understand as their reputation suggests; they are basically a consistent set of rules which allow one to always take limits (or make similar decisions) whenever necessary.

To motivate them, let us recall some of the properties of convergent sequences from undergraduate real analysis. If $x_n$ is a convergent sequence of real numbers (where n ranges in the natural numbers), then we have a limit $\lim x_n$, which is also a real number. In addition to the usual analytical interpretations, we can also interpret the concept of a limit as a voting system, in which the natural numbers n are the voters, which are each voting for a real number $x_n$, and the limit $\lim x_n$ is the elected “winner” emerging from all of these votes. One can also view the limit (somewhat non-rigorously) as the expected value of $x_n$ when n is a “randomly chosen” natural number. Ignoring for now the issue that the natural numbers do not admit a uniform probability measure, it is intuitively clear that such a “randomly chosen” number is almost surely going to be larger than any fixed finite number, and so almost surely $x_n$ will be arbitrarily close to the limit $\lim x_n$ (thus we have a sort of “concentration of measure“).

These limits obey a number of laws, including

1. (Algebra homomorphism) If $x_n, y_n$ are convergent sequences, and c is a real number, then $\lim 1 = 1$, $\lim c x_n = c \lim x_n$, $\lim x_n + y_n = \lim x_n + \lim y_n$, and $\lim (x_n y_n) = (\lim x_n) (\lim y_n)$. (In particular, all sequences on the left-hand side are convergent.)
2. (Boundedness) If $x_n$ is a convergent sequence, then $\inf x_n \leq \lim x_n \leq \sup x_n$. (In particular, if $x_n$ is non-negative, then so is $\lim x_n$.)
3. (Non-principality) If $x_n$ and $y_n$ are convergent sequences which differ at only finitely many values of n, then $\lim x_n = \lim y_n$. [Thus, no individual voter has any influence on the outcome of the election!]
4. (Shift invariance) If $x_n$ is a convergent sequence, then for any natural number h we have $\lim x_{n+h} = \lim x_n$.

These properties are of course very useful in computing the limits of various convergent sequences. It is natural to wonder if it is possible to generalise the notion of a limit to cover various non-convergent sequences, such as the class $l^\infty({\Bbb N})$ of bounded sequences. There are of course many ways to do this in the literature (e.g. if one considers series instead of sequences, one has Cesàro summation, zeta function regularisation, etc.), but (as observed by Euler) one has to give up at least one of the above four limit laws if one wants to evaluate the limit of sequences such as $0,1,0,1,0,1,\ldots$. Indeed, if this sequence had a limit x, then the algebra homomorphism laws force $x^2 = x$ and thus x is either 0 or 1; on the other hand, the algebra homomorphism laws also show us that $1,0,1,0,\ldots$ has a limit 1-x, and hence by shift invariance we have x = 1-x, which is inconsistent with the previous discussion. In the voting theory interpretation, the problem here is one of lack of consensus: half of the voters want 0 and the other half want 1, and how can one consistently and fairly elect a choice from this? Similarly, in the probabilistic interpretation, there is no concentration of measure; a randomly chosen $x_n$ is not close to its expected value of 1/2, but instead fluctuates randomly between 0 and 1.

So, to define more general limits, we have to give up something. We shall give up shift-invariance (property 4). In the voting theory interpretation given earlier, this means that we abandon the pretense that the election is going to be “fair”; some voters (or groups of voters) are going to be treated differently than others, due to some arbitrary choices made in designing the voting system. (This is the first hint that the axiom of choice will be involved.) Similarly, in the probabilistic interpretation, we will give up the notion that the “random number” n we will choose has a shift-invariant distribution, thus for instance n could have a different distribution than n+1.

Suppose for the moment that we managed to have an improved concept of a limit which assigned a number, let’s call it $p\!-\!\lim x_n$, to any bounded sequence, which obeyed the properties 1-3. It is then easy to see that this p-limit extends the ordinary notion of a limit, because if a sequence $x_n$ is convergent, then after modifying the sequence on finitely many elements we can keep the sequence within $\epsilon$ of $\lim x_n$ for any specified $\epsilon > 0$, which implies (by properties 2, 3) that $p\!-\!\lim x_n$ stays within $\epsilon$ of $\lim x_n$, and the claim follows.

Now suppose we consider a Boolean sequence $x_n$ – one which takes only the values 0 and 1. Since $x_n^2 = x_n$ for all n, we see from property that $(p\!-\!\lim x_n)^2 = p\!-\!\lim x_n$, thus $p\!-\!\lim x_n$ must also be either 0 or 1. From a voting perspective, the p-limit is a voting system: a mechanism for extracting a yes-no answer out of the yes-no preferences of an infinite number of voters.

Let p denote the collection of all subsets A of the natural numbers such that the indicator sequence of A (i.e. the boolean sequence $x_n$ which equals 1 when n lies in A and equals 0 otherwise) has a p-limit of 1; in the voting theory language, p is the collection of all voting blocs who can decide the outcome of an election by voting in unison, while in the probability theory language, p is the collection of all sets of natural numbers of measure 1. It is easy to verify that p has four properties:

1. (Monotonicity) If A lies in p, and B contains A, then B lies in p.
2. (Closed under intersection) If A and B lie in p, then $A \cap B$ also lies in p.
3. (Dichotomy) If A is any set of natural numbers, either A or its complement lies in p, but not both.
4. (Non-principality) If one adds (or deletes) a finite number of elements to (or from) a set A, this does not affect whether the set A lies in p.

A collection p obeying properties 1, 2 is called a filter; a collection obeying 1,2,3 is called an ultrafilter, and a collection obeying 1,2,3,4 is a non-principal ultrafilter. [In contrast, a principal ultrafilter is one which is controlled by a single index $n_0$ in the sense that $p = \{ A: n_0 \in A \}$. In voting theory language, this is a scenario in which $n_0$ is a dictator; in probability language, the random variable n is now a deterministic variable taking the values of $n_0$.]

A property A(n) pertaining to a natural number n can be said to be p-true if the set $\{ n: A(n) \hbox{ true}\}$ lies in p, and p-false otherwise; for instance any tautologically true statement is also p-true. Using the probabilistic interpretation, these notions are analogous to those of “almost surely true” and “almost surely false” in probability theory. (Indeed, one can view p as being a probability measure on the natural numbers which always obeys a zero-one law, though one should caution that this measure is only finitely additive rather than countably additive, and so one should take some care in applying measure-theoretic technology directly to an ultrafilter.)

Properties 1-3 assert that this notion of “p-truth” obeys the usual laws of propositional logic, e.g. property 2 asserts that if A is p-true and B is p-true, then so is “A and B”, while property 3 is the familiar law of the excluded middle and property 1 is modus ponens. This is actually rather remarkable: it asserts that ultrafilter voting systems cannot create voting paradoxes, such as those guaranteed by Arrow’s theorem. There is no contradiction here, because Arrow’s theorem only applies to finite (hence compact) electorates of voters, which do not support any non-principal ultrafilters. At any rate, we now get a hint of why ultrafilters are such a useful concept in logic and model theory.

We have seen how the notion of a p-limit creates a non-principal ultrafilter p. Conversely, once one has a non-principal ultrafilter p, one can uniquely recover the p-limit operation. This is easiest to explain using the voting theory perspective. With the ultrafilter p, one can ask yes-no questions of an electorate, by getting each voter to answer yes or no and then seeing whether the resulting set of “yes” voters lies in p. To take a p-limit of a bounded sequence $x_n$, say in [0,1], what is going on is that each voter n has his or her own favourite candidate number $x_n$ between 0 and 1, and one has to elect a real number x from all these preferences. One can do this by an infinite electoral version of “Twenty Questions“: one asks all the voters whether x should be greater than 1/2 or not, and uses p to determine what the answer should be; then, if x is to be greater than 1/2, one asks whether x should be greater than 3/4, and so on and so forth. This eventually determines x uniquely; the properties 1-4 of the ultrafilter can be used to derive properties 1-3 of the p-limit.

A modification of the above argument also lets us take p-limits of any sequence in a compact metric space (or slightly more generally, in any compact Hausdorff first-countable topological space). These p-limits then behave in the expected manner with respect to operations in those categories, such as composition with continuous functions or with direct sum. As for unbounded real-valued sequences, one can still extract a p-limit as long as one works in a suitable compactification of the reals, such as the extended real line.

The reconstruction of p-limits from the ultrafilter p is also analogous to how, in probability theory, the concept of expected value of a (say) non-negative random variable X can be reconstructed from the concept of probability via the integration formula ${\Bbb E}(X) = \int_0^\infty {\Bbb P}(X \geq \lambda)\ d\lambda$. Indeed, one can define $p\!-\!\lim x_n$ to be the supremum of all numbers x such that the assertion $x_n > x$ is p-true, or the infimum of all numbers y that $x_n < y$ is p-true.

We have said all these wonderful things about non-principal ultrafilters, but we haven’t actually shown that these amazing objects actually exist. There is a good reason for this – the existence of non-principal ultrafilters requires the axiom of choice (or some slightly weaker versions of this axiom, such as the boolean prime ideal theorem). Let’s give two quick proofs of the existence of a non-principal ultrafilter:

Proof 1. Let q be the set of all cofinite subsets of the natural numbers (i.e. sets whose complement is finite). This is clearly a filter which is proper (i.e. it does not contain the empty set $\emptyset$). Since the union of any chain of proper filters is again a proper filter, we see from Zorn’s lemma that q is contained in a maximal proper filter p. It is not hard to see that p must then be a non-principal ultrafilter. $\Box$

Proof 2. Consider the Stone–Čech compactification $\beta {\Bbb N}$ of the natural numbers. Since ${\Bbb N}$ is not already compact, there exists an element p of this compactification which does not lie in ${\Bbb N}$. Now note that any bounded sequence $x_n$ on the natural numbers is a bounded continuous function on ${\Bbb N}$ (since ${\Bbb N}$ is discrete) and thus, by definition of $\beta {\Bbb N}$, extends uniquely to a bounded continuous function on $\beta {\Bbb N}$, in particular one can evaluate this function at p to obtain a real number $x_p$. If one then defines $p\!-\!\lim x_n := x_p$ one easily verifies the properties 1-4 of a p-limit, which by the above discussion creates a non-principal ultrafilter (which by abuse of notation is also referred to as p; indeed, $\beta {\Bbb N}$ is canonically identifiable with the space of all ultrafilters). $\Box$

These proofs are short, but not particularly illuminating. A more informal, but perhaps more instructive, explanation of why non-principal ultrafilters exist can be given as follows. In the voting theory language, our task is to design a complete and consistent voting system for an infinite number of voters. In the cases where there is near-consensus, in the sense that all but finitely many of the voters vote one way or another, the decision is clear – go with the option which is preferred by the infinite voting bloc. But what if an issue splits the electorate with an infinite number of voters on each side? Then what one has to do is make an arbitrary choice – pick one side to go with and completely disenfranchise all the voters on the other side, so that they will have no further say in any subsequent votes. By performing this disenfranchisement, we increase the total number of issues for which our electoral system can reach a consistent decision; basically, any issue which has the consensus of all but finitely many of those voters not yet disenfranchised can now be decided upon in a consistent (though highly unfair) manner. We now continue voting until we reach another issue which splits the remaining pool of voters into two infinite groups, at which point we have to make another arbitrary choice, and disenfranchise another infinite set of voters. Very roughly speaking, if one continues this process of making arbitrary choices “ad infinitum”, then at the end of this transfinite process we eventually exhaust the (uncountable) number of issues one has to decide, and one ends up with the non-principal ultrafilter. (If at any stage of the process one decided to disenfranchise all but finitely many of the voters, then one would quickly end up with a principal ultrafilter, i.e. a dictatorship.) One should take this informal argument with a grain of salt; it turns out that after one has made an infinite number of choices, the infinite number of disenfranchised groups, while individually having no further power to influence elections, can begin having some collective power, basically because property 2 of a filter only guarantees closure under finite intersections and not infinite intersections, and things begin to get rather complicated. At this point, I recommend abandoning the informal picture and returning to Zorn’s lemma :-) .

With this informal discussion, it is now rather clear why the axiom of choice (or something very much like that axiom) needs to play a role in constructing non-principal ultrafilters. However, one may wonder whether one really needs the full strength of an ultrafilter in applications; to return once again to the voting analogy, one usually does not need to vote on every single conceivable issue (of which there are uncountably many) in order to settle some problem; in practice, there are often only a countable or even finite number of tricky issues which one needs to put to the ultrafilter to decide upon. Because of this, many of the results in soft analysis which are proven using ultrafilters can instead be established using a “poor man’s non-standard analysis” (or “pre-infinitary analysis”) in which one simply does the “voter disenfranchisement” step mentioned above by hand. This step is more commonly referred to as the trick of “passing to a subsequence whenever necessary”, and is particularly popular in the soft analysis approach to PDE and calculus of variations. For instance, to minimise some functional, one might begin with a minimising sequence. This sequence might not converge in any reasonable topology, but it often lies in a sequentially compact set in some weak topology (e.g. by using the sequential version of the Banach-Alaoglu theorem), and so by passing to a subsequence one can force the sequence to converge in this topology. One can continue passing to a subsequence whenever necessary to force more and more types of convergence, and can even diagonalise using the Arzela-Ascoli argument to achieve a countable number of convergences at once (this is of course the sequential Banach-Alaoglu theorem in disguise); in many cases, one gets such a strong convergence that one can then pass to the limit. Most of these types of arguments could also be equivalently performed by selecting an ultrafilter p at the very beginning, and replacing the notions of limit by p-limit throughout; roughly speaking, the ultrafilter has performed all the subsequence-selection for you in advance, and all your sequences in compact spaces will automatically converge without the need to pass to any further subsequences. (For much the same reason, ultrafilters can be used to simplify a lot of infinitary Ramsey theory, as all the pigeonholing has been done for you in advance.) On the other hand, the “by hand” approach of selecting subsequences explicitly tends to be much more constructive (for instance, it can often be performed without any appeal to the axiom of choice), and can also be more easily converted to a quantitative “hard analysis” argument (for instance, by using the finite convergence principle discussed in earlier posts).

As a concrete example from my own experience, in one of my PDE papers with the “I-team” (Colliander, Keel, Staffilani, Takaoka, and myself), we had a rather severe epsilon management problem in our “hard analysis” arguments, requiring in fact seven very different small quantities $1 \gg \eta_0 \gg \ldots \gg \eta_6 > 0$, with each $\eta_i$ extremely small compared with the previous one. (As a consequence of this and of our inductive argument, our eventual bounds, while quantitative, were extremely large, requiring a nine-fold iterated Knuth arrow!) This epsilon management also led to the paper being unusually lengthy (85 pages). Subsequently, (inspired by a later paper of Kenig and Merle), I learnt how the use of the above “poor man’s non-standard analysis” could conceal almost all of these epsilons (indeed, due to concentration-compactness one can soon pass to a limiting object in which most of the epsilons get sent to zero). Partly because of this, a later paper of myself, Visan, and Zhang on a very similar topic, which adopted this softer approach, was significantly shorter (28 pages, although to be fair this paper also relies on an auxiliary 30-page paper), though to compensate for this it becomes much more difficult to extract any sort of quantitative bound from the argument.

For the purposes of non-standard analysis, one non-principal ultrafilter is much the same as any other. But it turns out that if one wants to perform additive operations on the index set n, then there is a special (and very useful) class of non-principal ultrafilters that one can use, namely the idempotent ultrafilters. These ultrafilters p almost recover the shift-invariance property (which, as remarked earlier, cannot be perfectly attained for ultrafilters) in the following sense: for p-almost all h, the ultrafilter p is equal to its translate p+h, or equivalently that $p\!-\!\lim_h p\!-\!\lim_n x_{n+h} = p\!-\!\lim_n x_n$ for all bounded sequences. (In the probability theory interpretation, in which p-limits are viewed as an expectation, this is analogous to saying that the probability measure associated to p is idempotent under convolution, hence the name). Such ultrafilters can, for instance, be used to give a short proof of Hindman’s theorem, which is otherwise rather unpleasant to prove. There are even more special ultrafilters known as minimal idempotent ultrafilters, which are quite useful in infinitary Ramsey theory, but these are now rather technical and I will refer the reader to a survey paper of Bergelson for details. I will note however one amusing feature of these objects; whereas “ordinary” non-principal ultrafilters require an application of Zorn’s lemma (or something similar) to construct them, these more special ultrafilters require multiple applications of Zorn’s lemma – i.e. a nested transfinite induction! Thus these objects are truly deep in the “infinitary” end of the finitary-infinitary spectrum of mathematics.

— Non-standard models —

We have now thoroughly discussed non-principal ultrafilters, interpreting them as voting systems which can extract a consistent series of decisions out of a countable number of independent voters. With this we can now discuss non-standard models of a mathematical system. There are a number of ways to build these models, but we shall stick to the most classical (and popular) construction.

Throughout this discussion we fix a single non-principal ultrafilter p. Now we make the following general definition.

Definition. Let X be any set. The ultrapower *X of X is defined to be the collection of all sequences $(x_n)$ with entries in X, modulo the equivalence that two sequences $(x_n), (y_n)$ are considered equal if they agree p-almost surely (i.e. the statement $x_n=y_n$ is p-true).

If X is a class of “standard” objects, we shall view *X as the corresponding class of “nonstandard” objects. Thus, for instance, ${\Bbb R}$ is the class of standard real numbers, and ${*}{\Bbb R}$ is the class of non-standard real (or hyperreal) numbers, with each non-standard real number being uniquely representable (up to p-almost sure equivalence) as an arbitrary sequence of standard real numbers (not necessarily convergent or even bounded). What one has done here is “democratised” the class X; instead of declaring a single object x in X that everyone has to work with, one allows each voter n in a countable electorate to pick his or her own object $x_n \in X$ arbitrarily, and the voting system p will then be used later to fashion a consensus as to the properties of these objects; this is why we can identify any two sets of voter choices which are p-almost surely identical. We shall abuse notation a little bit and use sequence notation $(x_n)$ to denote a non-standard element, even though strictly speaking one should deal with equivalence classes of sequences (just like how an element of an $L^p$ space is not, strictly speaking, a single function, but rather an equivalence class of functions that agree almost everywhere).

One can embed any class X of standard objects in its nonstandard counterpart *X, by identifying an element x with the constant sequence $x_n := x$; thus standard objects correspond to unanimous choices of the electorate. This identification is obviously injective. On the other hand, it is rather clear that *X is likely to be significantly larger than X itself.

Any operation or relation on a class (or several classes) of standard objects can be extended to the corresponding class(es) of nonstandard objects, simply by working pointwise on each n separately. For instance, the sum of two non-standard real numbers $(x_n)$ and $(y_n)$ is simply $(x_n+y_n)$; each voter in the electorate performs the relevant operation (in this case, addition) separately. Note that the fact that these sequences are only defined p-almost surely does not create any ambiguity. Similarly, we say that one non-standard number $(x_n)$ is less than another $(y_n)$, if the statement $x_n < y_n$ is p-true. And so forth. There is no direct interaction between different voters (which, in view of the lack of shift invariance, is a good thing); it is only through the voting system p that there is any connection at all between all of the individual voters.

For similar reasons, any property that one can define on a standard object, can also be defined on a non-standard object. For instance, a non-standard integer $m = (m_n)$ is prime iff the statement “$m_n$ is prime” is p-true; a non-standard function $f = (f_n)$ is continuous iff the statement “$f_n$ is continuous” is p-true; and so forth. Basically, if you want to know anything about a non-standard object, go put your question to all the voters, and then feed the answers into the ultrafilter p to get the answer to your question. The properties 1-4 (actually, just 1-3) of the ultrafilter ensure that you will always get a consistent answer out of this.

It is then intuitively obvious that any “simple” property that a class of standard objects has, will be automatically inherited by its nonstandard counterpart. For instance, since addition is associative in the standard real numbers, it will be associative in the non-standard real numbers. Since every non-zero standard real number is invertible in the standard real numbers, so is every non-zero non-standard real number (why?). Because Fermat’s last theorem is true for standard natural numbers, it is true for non-standard natural numbers (why?). And so forth. Now, what exactly does “simple” mean? Roughly speaking, any statement in first-order logic will transfer over from a standard class to a non-standard class, as long as the statement does not itself use p-dependent terms such as “standard” or “non-standard” anywhere. One could state a formal version of this principle here, but I find it easier just to work through examples such as the ones given above to get a sense of why this should be the case.

Now the opposite is also true; any statement in first-order logic, avoiding p-dependent terms such as standard and non-standard, which is true for non-standard classes of objects, is automatically true for standard classes also. This follows just from applying the above principle to the negation of the statement one is interested in. Suppose for instance that one has somehow managed to prove the twin prime conjecture for non-standard natural numbers. To see why this then implies the twin prime conjecture for standard natural numbers, we argue by contraduction. If the statement “the twin prime conjecture failed” was true for standard natural numbers, then it would also be true for non-standard natural numbers (it is instructive to work this out explicitly), a contradiction.

That’s the transfer principle in a nutshell; informally, everything which avoids p-dependent terminology and which is true in standard mathematics, is also true in non-standard mathematics, and vice versa. Thus the two models are syntactically equivalent, even if they are semantically rather different. So, if the two models of mathematics are equivalent, why bother working in the latter, which looks much more complicated? It is because in the non-standard model one acquires some additional useful adjectives, such as “standard”. Some of the objects in one’s classes are standard, and others are not. One can use this new adjective (and some others which we will define shortly) to perform manipulations in the non-standard universe which have no obvious counterpart in the standard universe. One can then hope to use those manipulations to eventually end up at a non-trivial new theorem in the standard world, either by arriving at a statement in the non-standard world which no longer uses adjectives such as “standard” and can thus be fed into the transfer principle, or else by using some other principles (such as the overspill principle) to convert a non-standard statement involving p-dependent adjectives into a standard statement. It’s similar to how, say, one can find a real root of a real polynomial by embedding the real numbers in the complex numbers, performing some mathematical manipulations in the complex domain, and then verifying that the complex-valued answer one gets is in fact real-valued.

Let’s give an example of a non-standard number. Let $\omega$ be the non-standard natural number (n), i.e. the sequence $0, 1, 2, 3, \ldots$ (up to p-almost sure equivalence, of course). This number is larger than any standard number; for instance, the standard number 5 corresponds to the sequence $5, 5, 5, \ldots$; since n exceeds 5 for all but finitely many values of n, we see that $n > 5$ is p-true and hence $\omega > 5$. More generally, let us say that a non-standard number is limited if its magnitude is bounded by a standard number, and unlimited otherwise; thus $\omega$ is unlimited. The notion of “limited” is analogous to the notion of being O(1) discussed earlier, but unlike the O() notation, there are no implicit quantifiers that require care to manipulate (though as we shall see shortly, the difficulty has not gone away completely).

One also sees, for instance, that $2\omega$ is larger than the sum of $\omega$ and any limited number, that $\omega^2$ is larger than the product of $\omega$ with any limited number, and so forth. It is also clear that the sum or product of any two limited numbers is limited. The number $1/\omega$ has magnitude smaller than any positive standard real number and is thus considered to be an infinitesimal. Using p-limits, we quickly verify that every limited number x can be uniquely expressed as the sum of a standard number st(x) and an infinitesimal number x-st(x). (The map $x \mapsto st(\log_\omega x)$, by the way, is a homomorphism from the semiring of non-standard positive reals to the tropical semiring $({\Bbb R}, \min, +)$, and thus encodes the correspondence principle between ordinary rings and tropical rings.) The set of standard numbers, the set of limited numbers, and the set of infinitesimal numbers are all subrings of the set of all non-standard numbers. A non-zero number is infinitesimal if and only if its reciprocal is unlimited.

Now at this point one might be suspicious that one is beginning to violate some of the axioms of the natural numbers or real numbers, in contradiction to the transfer principle alluded to earlier. For instance, the existence of unlimited non-standard natural numbers seems to contradict the well-ordering property: if one defines $S \subset {*}{\Bbb N}$ to be the set of all unlimited non-standard natural numbers, then this set is non-empty, and so the well-ordering property should then provide a minimal unlimited non-standard number $\inf(S) \in {*}{\Bbb N}$. But then $\inf(S)-1$ must be unlimited also, a contradiction. What’s the problem here?

The problem here is rather subtle: a set of non-standard natural numbers is not quite the same thing as a non-standard set of natural numbers. In symbols: if $2^X$ denotes the power set of X, then $2^{*{\Bbb N}} \not \equiv *(2^{\Bbb N})$. Let’s look more carefully. What is a non-standard set $A \in *(2^{\Bbb N})$ of natural numbers? This is basically a sequence $(A_n)$ of sets of natural numbers, one for each voter. Any given non-standard natural number $m = (m_n)$ may belong to A or not, depending on whether the statement $m_n \in A_n$ is p-true or not. We can collect all the non-standard numbers m which do belong in A, and call this set $\tilde A$; this is thus an element of $2^{*{\Bbb N}}$. The map $A \mapsto \tilde A$ from $*(2^{\Bbb N})$ to $2^{*{\Bbb N}}$ turns out to be injective (why? this is the transferred axiom of extensionality), but it is not surjective; there are some sets of non-standard natural numbers which are not non-standard sets of natural numbers, and as such the well-ordering principle, when transferred over from standard mathematics, does not apply to them. This subtlety is all rather confusing at first, but a good rule of thumb is that as long as your set (or function, or whatever) is not defined using p-dependent terminology such as “standard” or “limited”, it will be a non-standard set (or a non-standard function, etc.); otherwise it will merely be a set of non-standard objects (or a function from one non-standard set to another, etc.). [The situation here is similar to that with the adjective “constructive”; not every function from the constructive numbers to the constructive numbers is itself a constructive function, and so forth.]

It is worth comparing the situation here with that with the O() notation. With O(), the axiom schema of specification is simply inapplicable; one cannot form a set using O() notation inside the definition (though I must admit that I have occasionally been guilty of abusing notation and violating the above rule in my own papers). In non-standard analysis, in contrast, one can use terminology such as “limited” to create sets of non-standard objects, which then enjoy some useful structure (e.g. the set of limited numbers is a ring). It’s just that these sets are not themselves non-standard, and thus not subject to the transfer principle.

— Example: calculus via infinitesimals —

Historically, one of the original motivations of non-standard analysis was to make rigorous the manipulations of infinitesimals in calculus. While this is not the main focus of my post here, I will give just one small example of how non-standard analysis is applied in differential calculus. If x and y are two non-standard real numbers, with y positive, we write x=o(y) if x/y is infinitesimal. The key lemma is

Lemma 1. Let $f: {\Bbb R} \to {\Bbb R}$ be a standard function, and let x, L be standard real numbers. We identify f with a non-standard function in the usual manner. Then the following are equivalent:

1. f is differentiable at x with derivative f'(x) = L.
2. For any infinitesimal h, we have $f(x+h) = f(x) + h L + o(|h|)$.

Lemma 1 looks very similar to linear Taylor expansion, but note that there are no limits involved (despite the suggestive o() notation); instead, we have the concept of an infinitesimal. The implication of (2.) from (1.) follows easily from the definition of derivative, the transfer principle, and the fact that infinitesimals are smaller in magnitude than any positive standard real number. The implication of (1.) from (2.) can be seen by contradiction; if f is not differentiable at x with derivative L, then (by the axiom of choice) there exists a sequence $h_n$ of standard real numbers going to zero, such that the Newton quotient $(f(x+h_n)-f(x))/h_n$ is bounded away from L by a standard positive number. One now forms the non-standard infinitesimal $h = (h_n)$ and obtains a contradiction to (2.).

Using this equivalence, one can now readily deduce the usual laws of differential calculus, e.g. the chain rule, product rule, and mean value theorem; the proofs are algebraically almost identical to the usual proofs (especially if one rewrites those proofs in o() notation), but one does not need to deal explicitly with epsilons, deltas, and limits (the ultrafilter has in some sense already done all that for you). The epsilon management is done invisibly and automatically; one does not need to keep track of whether one has to choose epsilon first before selecting delta, or vice versa. In particular, most of the existential quantifiers (“… there exists $\epsilon$ such that …”) have been eliminated, leaving only the more pleasant universal quantifiers (“for every infinitesimal h …”).

There is one caveat though: Lemma 1 only works when x is standard. For instance, consider the standard function $f(x) := x^2 \sin(1/x^3)$, with the convention f(0)=0. This function is everywhere differentiable, and thus extending to non-standard numbers we have $f(x+h) = f(x) + h f'(x) + o(|h|^2)$ for all standard x and infinitesimal h. However, the same claim is not true for arbitrary non-standard x; consider for instance what happens if one sets x=-h.

One can also obtain an analogous characterisation of the Riemann integral: a standard function f is Riemann integrable on an interval [a,b] with integral A if and only if one has

$A = \sum_{1 \leq i < n} f(x^*_i) (x_{i+1}-x_i) + o(1)$

for any non-standard sequence

$a = x_1 \leq x_1^* \leq x_2 \leq \ldots \leq x_{n-1} \leq x_{n-1}^* \leq x_n = b$

with $\sup_{1 \leq i < n} (x_{i+1}-x_i)$ infinitesimal. One can then reprove the usual basic results, such as the fundamental theorem of calculus, in this manner; basically, the proofs are the same, but the limits have disappeared, being replaced by infinitesimals.

— Big O() notation —

Big O() notation in standard mathematics can be translated easily into the non-standard setting, as follows.

Lemma 2. Let $f: {\Bbb N} \to {\Bbb C}$ and $g: {\Bbb N} \to {\Bbb R}^+$ be standard functions (which can be identified with non-standard functions in the usual manner). Then the following are equivalent.

1. f(m) = O(g(m)) in the standard sense, i.e. there exists a standard positive real constant C such that $|f(m)| \leq C g(m)$ for all standard natural numbers n.
2. |f(m)|/g(m) is limited for every non-standard natural number m.

This lemma is proven similarly to Lemma 1; the implication of (2.) from (1.) is obvious from the transfer principle, while to the implication of (1.) from (2.) is again by contradiction, converting a sequence of increasingly bad counterexamples to (1.) to a counterexample to (2.). Lemma 2 is also a special case of the “overspill principle” in non-standard analysis, which asserts that a non-standard set of numbers which contains arbitrarily large standard numbers, must also contain an unlimited non-standard number (thus the large standard numbers “spill over” to contain some non-standard numbers). The proof of the overspill principle is related to the (specious) argument discussed above in which one tried to derive a contradiction from the set of unlimited natural numbers, and is left as an exercise.

[In some texts, the notation f = O(g) only requires that $|f(m)| \leq C g(m)$ for all sufficiently large m. The nonstandard counterpart to this is the claim that |f(m)|/g(m)| is limited for every unlimited non-standard m.]

Because of the above lemma, it is now natural to define the non-standard counterpart of the O() notation: if x, y are non-standard numbers with y positive, we say that x=O(y) if |x|/y is limited. Then the above lemma says that the standard and non-standard O() notations agree for standard functions of one variable. Note how the non-standard version of the O() notation does not have the existential quantifier (“… there exists C such that …”) and so the epsilon management is lessened. If we let ${\mathcal L}$ denote the subring of ${*}{\Bbb R}$ consisting of all limited numbers, then the claim x=y+O(z) can be rewritten as $x = y \hbox{ mod } z {\mathcal L}$, thus we see how the O() notation can be viewed algebraically as the operation of quotienting the (non-standard) real numbers by various dilates of the subring ${\mathcal L}$.

One can now convert many other order-of-magnitude notions to non-standard notation. For instance, suppose one is performing some standard hard analysis involving some large parameter $N > 1$, e.g. one might be studying a set of N points in some group or Euclidean space. One often wants to distinguish between quantities which are of polynomial size in N and those which are super-polynomial in size; for instance, these N points might lie in a finite group G, where G has size much larger than N, and one’s application is such that any bound which depends on the size of G will be worthless. Intuitively, the set of quantities which are of polynomial size in N should be closed under addition and multiplication and thus form a sort of subring of the real numbers, though in the standard universe this is difficult to formalise rigorously. But in nonstandard analysis, it is not difficult: we make N non-standard (and G too, in the above example), and declare any non-standard quantity x to be of polynomial size if we have $x = O(N^{O(1)})$, or equivalently if $\log x / \log N$ is limited. We can then legitimately form the set P of all non-standard numbers of polynomial size, and this is in fact a subring of the non-standard real numbers; as before, though, we caution that P is not a non-standard set of reals, and in particular is not a non-standard subring of the reals. Since P is a ring, one can then legitimately apply whatever results from ring theory one pleases to P, bearing in mind though that any sets of non-standard objects one generates using that theory may not necessarily be non-standard objects themselves. At the end of the day, we then use the transfer principle to go back to the original problem in which N is standard.

As a specific example of this type of thing from my own experience, in one of my papers with Van Vu, we had a large parameter n, and had at some point to introduce the somewhat fuzzy notion of a “highly rational number”, by which we meant a rational number a/b whose numerator and denominator were both at most $n^{o(n)}$ in magnitude. Such numbers looked like they were forming a field, since the sum, difference, product, or quotient of two highly rational numbers was again highly rational (but with a slightly different rate of decay in the o() notation). Intuitively, one should be able to do any algebraic manipulation on highly rational numbers which is legitimate for true fields (e.g. using Cramer’s rule to invert a non-singular matrix) and obtain an output which is also highly rational, as long as the number of algebraic operations one uses is O(1) rather than, say, O(n). We could not actually formalise this rigorously in our standard notation, and had to resort to informal English sentences to describe this; but one can do everything perfectly rigorously in the non-standard setting by letting n be non-standard, and defining the field F of non-standard rationals a/b where $a, b = O(n^{o(n)})$; F is genuinely a field of non-standard rationals (but not a non-standard field of rationals), and so using Cramer’s rule here (but only for matrices of standard size) would be perfectly legitimate. (We did not actually write our argument in this non-standard manner, keeping everything in the usual standard hard analysis setting, but it would not have been difficult to rewrite the argument non-standardly, and there would be some modest simplifications.)

— A hierarchy of infinitesimals —

We have seen how, by selecting an ultrafilter p, we can extend the standard real numbers ${\Bbb R}$ to a larger system ${*}{\Bbb R}$, in which the original number system ${\Bbb R}$ becomes a real totally ordered subfield. (Exercise: is ${\Bbb R}$ complete? The answer depends on how one defines one’s terms.) This gives us some new objects, such as the infinitesimal $\eta_0$ given by the sequence $1, 1/2, 1/4, 1/8, \ldots$. This quantity is smaller than any standard positive number, in particular it is infinitesimally smaller than any quantity depending (via standard operations) on standard constants such as 1. One may think of ${*}{\Bbb R}$ as the non-standard extension of ${\Bbb R}$ generated by adjoining $\eta_0$; this is similar to the field extension ${\Bbb R}(\eta_0)$, but is much larger, because field extensions are only closed under arithmetic operations, whereas non-standard extensions are closed under all definable operations. For instance, $\exp(1/\eta_0)$ lies in ${*}{\Bbb R}$ but not in ${\Bbb R}(\eta_0)$.

Now it is possible to iterate this process, by introducing an non-standard nonprincipal ultrafilter *p on the non-standard natural numbers ${*}{\Bbb N}$ (or more precisely, an ultrafilter on ${*}{\Bbb N}$ that extends an ultraproduct of nonprincipal ultrafilters on ${\Bbb N}$), and then embedding the field ${*}{\Bbb R}$ inside an even larger system ${*}{*}{\Bbb R}$, whose elements can be identified (modulo *p-almost sure equivalence) with non-standard sequences $(x_n)$ of non-standard numbers in ${*}{\Bbb R}$ (where n now ranges over the non-standard natural numbers ${*}{\Bbb N}$); one could view these as “doubly non-standard numbers”. This gives us some “even smaller” infinitesimals, such as the “doubly infinitesimal” number $\eta_1$ given by the non-standard sequence $1, \eta_0, \eta_0^2, \eta_0^3, \ldots$. This quantity is smaller than any standard or (singly) non-standard number, in particular infinitesimally smaller than any positive quantity depending (via standard or singly non-standard operations) on standard or singly non-standard constants such as 1 or $\eta_0$. For instance, it is smaller than $1/A(\lfloor 1/\eta_0\rfloor)$, where A is the Ackermann function, since the sequence that defines $\eta_1$ is indexed over the non-standard natural numbers and $\eta_0^n$ will drop below $1/A(\lfloor 1/\eta_0\rfloor)$ for sufficiently large non-standard n.

One can continue in this manner, creating a triply infinitesimal quantity $\eta_2$ which is infinitesimally smaller than anything depending on 1, $\eta_0$, or $\eta_1$, and so forth. Indeed one can iterate this construction an absurdly large number of times, though in most applications one only needs an explicitly finite number of elements from this hierarchy. Having this hierarchy of infinitesimals, each one of which is guaranteed to be infinitesimally small compared to any quantity formed from the preceding ones, is quite useful: it lets one avoid having to explicitly write a lot of epsilon-management phrases such as “Let $\eta_2$ be a small number (depending on $\eta_0$ and $\eta_1$) to be chosen later” and “… assuming $\eta_2$ was chosen sufficiently small depending on $\eta_0$ and $\eta_1$“, which are very frequent in hard analysis literature, particularly for complex arguments which involve more than one very small or very large quantity. (My I-team paper referred to earlier is of this type.)

— Conclusion —

I hope I have shown that non-standard analysis is not a totally “alien” piece of mathematics, and that it is basically only “one ultrafilter away” from standard analysis. Once one selects an ultrafilter, it is actually relatively easy to swap back and forth from the standard universe and the non-standard one (or to doubly non-standard universes, etc.). This allows one to rigorously manipulate things such as “the set of all small numbers”, or to rigorously say things like “$\eta_1$ is smaller than anything that involves $\eta_0$“, while greatly reducing epsilon management issues by automatically concealing many of the quantifiers in one’s argument. One has to take care as to which objects are standard, non-standard, sets of non-standard objects, etc., especially when transferring results between the standard and non-standard worlds, but as long as one is clearly aware of the underlying mechanism used to construct the non-standard universe and transfer back and forth (i.e. as long as one understands what an ultrafilter is), one can avoid difficulty. The main drawbacks to use of non-standard notation (apart from the fact that it tends to scare away some of your audience) is that a certain amount of notational setup is required at the beginning, and that the bounds one obtains at the end are rather ineffective (though, of course, one can always, after painful effort, translate a non-standard argument back into a messy but quantitative standard argument if one desires).

I plan to discuss some specific translations of some recent hard analysis results (e.g. the sum-product theory of Bourgain) to the non-standard analysis setting some time in the future, in the spirit of other translations such as van der Dreis and Wilkie’s non-standard proof of Gromov’s theorem on groups of polynomial growth.