This is going to be a somewhat experimental post. In class, I mentioned that when solving the type of homework problems encountered in a graduate real analysis course, there are really only about a dozen or so basic tricks and techniques that are used over and over again. But I had not thought to actually try to make these tricks explicit, so I am going to try to compile here a list of some of these techniques here. But this list is going to be far from exhaustive; perhaps if other recent students of real analysis would like to share their own methods, then I encourage you to do so in the comments (even – or especially – if the techniques are somewhat vague and general in nature).

(See also the Tricki for some general mathematical problem solving tips.  Once this page matures somewhat, I might migrate it to the Tricki.)

Note: the tricks occur here in no particular order, reflecting the stream-of-consciousness way in which they were arrived at.  Indeed, this list will be extended on occasion whenever I find another trick that can be added to this list.

1.  Split up equalities into inequalities.

If one has to show that two numerical quantities X and Y are equal, try proving that X \leq Y and Y \leq X separately.  (Often one of these will be very easy, and the other one harder; but the easy direction may still provide some clue as to what needs to be done to establish the other direction.)

In a similar spirit, to show that two sets E and F are equal, try proving that E \subset F and F \subset E.

2.  Give yourself an epsilon of room.

If one has to show that X \leq Y, try proving that X \leq Y+\varepsilon for any \varepsilon > 0. (This trick combines well with Trick 1.)

In a similar spirit, if one needs to show that a quantity X vanishes, try showing that |X| \leq \varepsilon for every \varepsilon > 0.

Or: if one wishes to show that two functions f, g agree almost everywhere, try showing first that |f(x)-g(x)| \leq \varepsilon holds for almost every x, or even just outside of a set of measure at most \varepsilon, for any given \varepsilon > 0.

Or: if one wants to show that a sequence x_n of real numbers converges to zero, try showing that \limsup_{n \to \infty} |x_n| \leq \varepsilon for every \varepsilon > 0.

Don’t be too focused on getting all your error terms adding up to exactly \varepsilon – usually, as long as the final error bound consists of terms that can all be made as small as one wishes by choosing parameters in a suitable way, that is enough.  For instance, an error term such as 10\varepsilon is certainly OK, or even more complicated expressions such as 10 \varepsilon/\delta + 4 \delta if one has the ability to choose \delta as small as one wishes, and then after \delta is chosen, one can then also set \varepsilon as small as one wishes (in a manner that can depend on \delta).

One caveat: for finite x, and any \varepsilon > 0, it is true that x+\varepsilon > x and x-\varepsilon < x, but this statement is not true when x is equal to +\infty (or -\infty).  So remember to exercise some care with the epsilon of room trick when some quantities are infinite.

See also the Tricki article on this topic.

3.  Decompose or approximate a rough or general object by a smoother or simpler one.

If one has to prove something about an unbounded (or infinite measure) set, consider proving it for bounded (or finite measure) sets first if this looks easier.


  1. if one has to prove something about a measurable set, try proving it for open, closed, compact, bounded, or elementary sets first.
  2. if one has to prove something about a measurable function, try proving it for functions that are continuous, bounded, compactly supported, simple, absolutely integrable, etc.
  3. if one has to prove something about an infinite sum or sequence, try proving it first for finite truncations of that sum or sequence (but try to get all the bounds independent of the number of terms in that truncation, so that you can still pass to the limit!).
  4. if one has to prove something about a complex-valued function, try it for real-valued functions first.
  5. If one has to prove something about a real-valued function, try it for unsigned functions first.
  6. If one has to prove something about a simple function, try it for indicator functions first.

In order to pass back to the general case from these special cases, one will have to somehow decompose the general object into a combination of special ones, or approximate general objects by special ones (or as a limit of a sequence of special objects).  In the latter case, one may need an epsilon of room (trick 2), and some sort of limiting analysis may be needed to deal with the errors in the approximation (it is not always enough to just “pass to the limit”, as one has to justify that the desirable properties of the approximating object are preserved in the limit).

Note: one should not do this blindly, as one might then be loading on a bunch of distracting but ultimately useless hypotheses that end up being a lot less help than one might hope.  But they should be kept in mind as something to try if one starts having thoughts such as “Gee, it would be nice at this point if I could assume that f is continuous / real-valued / simple / unsigned / etc. “.

In the more quantitative areas of analysis and PDE, one sees a common variant of the above technique, namely the method of a priori estimates.  Here, one needs to prove an estimate or inequality for all functions in a large, rough class (e.g. all rough solutions to a PDE).  One can often then first prove this inequality in a much smaller (but still “dense”) class of “nice” functions, so that there is little difficulty justifying the various manipulations (e.g. exchanging integrals, sums, or limits, or integrating by parts) that one wishes to perform.  Once one obtains these a priori estimates, one can then often take some sort of limiting argument to recover the general case.

4.  If one needs to flip an upper bound to a lower bound or vice versa, look for a way to take reflections or complements.

Sometimes one needs a lower bound for some quantity, but only has techniques that give upper bounds.  In some cases, though, one can “reflect” an upper bound into a lower bound (or vice versa) by replacing a set E contained in some space X with its complement X \backslash E, or a function f with its negation -f (or perhaps subtracting f from some dominating function F to obtain F-f).  This trick works best when the objects being reflected are contained in some sort of “bounded”, “finite measure”, or “absolutely integrable” container, so that one avoids having the dangerous situation of having to subtract infinite quantities from each other.

The triangle inequality

\| f \| \geq \|g \| - \| f-g\|

can be used to flip an upper bound on \|f-g\| to a lower bound on \|f\|, provided of course one has a lower bound on \|g\|.  Similarly, the Cauchy-Schwarz inequality

|\langle f,g \rangle| \leq \|f\| \|g\|

can flip a upper bound on \|g\| to a lower bound on \|f\|, provided that one has a lower bound on |\langle f, g \rangle|.  Holder’s inequality can also be used in a similar fashion.

5.  Uncountable unions can sometimes be replaced by countable or finite unions.

Uncountable unions are not well-behaved in measure theory; for instance, an uncountable union of null sets need not be a null set (or even a measurable set).  (On the other hand, the uncountable union of open sets remains open; this can often be important to know.)  However, in many cases one can replace an uncountable union by a countable one.  For instance, if one needs to prove a statement for all \varepsilon > 0, then there are an uncountable number of \varepsilon‘s one needs to check, which may threaten measurability; but in many cases it is enough to only work with a countable sequence of \varepsilons, such as the numbers 1/m for m=1,2,3,\ldots.

In a similar spirit, given a real parameter \lambda, this parameter initially ranges over uncountably many values, but in some cases one can get away with only working with a countable set of such values, such as the rationals.  In a similar spirit, rather than work with all boxes (of which there are uncountably many), one might work with the dyadic boxes (of which there are only countably many; also, they obey nicer nesting properties than general boxes and so are often desirable to work with in any event).

If you are working on a compact set, then one can often replace even uncountable unions with finite ones, so long as one is working with open sets.  When this option is available, it is often worth spending an epsilon of measure (or whatever other resource is available to spend) to make one’s sets open, just so that one can take advantage of compactness.

6.  If it is difficult to work globally, work locally instead.

A domain such as Euclidean space {\bf R}^d has infinite measure, and this creates a number of technical difficulties when trying to do measure theory directly on such spaces.  Sometimes it is best to work more locally, for instance working on a large ball B(0,R) or even a small ball such as B(x,\varepsilon) first, and then figuring out how to patch things together later.  Compactness (or the closely related property of total boundedness) is often useful for patching together small balls to cover a large ball.  Patching together large balls into the whole space tends to work well when the properties one are trying to establish are local in nature (such as continuity, or pointwise convergence) or behave well with respect to countable unions.  For instance, to prove that a sequence of functions f_n converges pointwise almost everywhere to f on {\bf R}^d, it suffices to verify this pointwise almost everywhere convergence on the ball B(0,R) for every R > 0 (which one can take to be an integer to get countability, see Trick 5).

7.  Be willing to throw away an exceptional set.

The “Lebesgue philosophy” to measure theory is that null sets are often “irrelevant”, and so one should be very willing to cut out a set of measure zero on which bad things are happening (e.g. a function is undefined or infinite, a sequence of functions is not converging, etc.).  One should also be only slightly less willing to throw away sets of positive but small measure, e.g. sets of measure at most \varepsilon.  If such sets can be made arbitrarily small in measure, this is often almost as good as just throwing away a null set.

Many things in measure theory improve after throwing away a small set.  The most notable examples of this are Egorov’s theorem (pointwise a.e. convergence becomes locally uniform convergence after throwing away a small set) and Lusin’s theorem (measurable functions become continuous after throwing away a small set).  A related (and simpler) principle is that a measurable function f that is finite almost everywhere can become locally bounded after throwing away a small set (this can be seen by applying downward monotone convergence to exceptional sets such as \{ x \in B(0,R): |f(x)| \geq N \} as N \to \infty).  From Markov’s inequality, one also sees that a function that is small in L^1 norm becomes small uniformly after throwing away a small set.  (The same is true for functions that are small in other L^p norms, or (by definition) for functions that are converging to zero in measure.)

A little later in this course, we will also start seeing a similar trick of throwing away most of a sequence and working with a subsequence instead.  (This trick can also be interpreted as “throwing away a small set”, but to understand what “small” means in this context, one needs the language of ultrafilters, which will not be discussed here.)  See Trick 17 below.

8.  Draw pictures and try to build counterexamples.

Measure theory, particularly on Euclidean spaces, has a significant geometric aspect to it, and you should be exploiting your geometric intuition.  Drawing pictures and graphs of all the objects being studied is a good way to start.  These pictures need not be completely realistic; they should be just complicated enough to hint at the complexities of the problem, but not more.  (For instance, usually one- or two-dimensional pictures suffice for understanding problems in {\bf R}^d; drawing intricate 3D (or 4D, etc.) pictures does not often make things simpler.  To indicate that a function is not continuous, one or two discontinuities or oscillations might suffice; make it too ornate and it becomes less clear what to do about that function.)

A common mistake is to try to draw a picture in which both the hypotheses and conclusion of the problem hold.  This is actually not all that useful, as it often does not reveal the causal relationship between the former and the latter.  One should try instead to draw a picture in which the hypotheses hold but for which the conclusion does not – in other words, a counterexample to the problem.  Of course, you should be expected to fail at this task, given that the statement of the problem is presumably true. However, the way in which your picture fails to accomplish this task is often very instructive, and can reveal vital clues as to how the solution to the problem is supposed to proceed.

9.  Try simpler cases first.

This advice of course extends well beyond measure theory, but if one is completely stuck on a problem, try making the problem simpler (while still capturing at least one of the difficulties of the problem that you cannot currently resolve).  For instance, if faced with a problem in {\bf R}^d, try the one-dimensional case d=1 first.  Faced with a problem about a general measurable function f, try a much simpler case first, such as an indicator function f = 1_E.  Faced with a problem about a general measurable set, try an elementary set first.  Faced with a problem about a sequence of functions, try a monotone sequence of functions first.  And so forth.  (Note that this trick overlaps quite a bit with Trick 3.)

The problem should not be made so simple that it becomes trivial, as this doesn’t really gain you any new insight about the original problem; instead, one should try to keep the “essential” difficulties of the problem while throwing away those aspects that you think are less important (but are still serving to add to the overall difficulty level).

On the other hand, if the simplified problem is unexpectedly easy, but one cannot extend the methods to the general case (or somehow leverage the simplified case to the general case, as in Trick 3), this is an indication that the true difficulty lies elsewhere.  For instance, if a problem involving general functions could be solved easily for monotone functions, but one cannot then extend that argument to the general case, this suggests that the true enemy is oscillation, and perhaps one should try another simple case in which the function is allowed to be highly oscillatory (but perhaps simple in other ways, e.g. bounded with compact support).

10.  Abstract away any information that you believe or suspect to be irrelevant.

Sometimes one is faced with an embarrassment of riches when it comes to what choice of technique to use on a problem; there are so many different facts that one knows about the problem, and so many different pieces of theory that one could apply, that one doesn’t quite know where to begin.

When this happens, abstraction can be a vital tool to clear away some of the conceptual clutter.  Here, one wants to “forget” part of the setting that the problem is phrased in, and only keep the part that seems to be most relevant to the hypotheses and conclusions of the problem (and thus, presumably, to the solution as well).

For instance, if one is working in a problem that is set in Euclidean space {\bf R}^d, but the hypotheses and conclusions only involve measure-theoretic concepts (e.g. measurability, integrability, measure, etc.) rather than topological structure, metric structure, etc., then it may be worthwhile to try abstracting the problem to the more general setting of an abstract measure space, thus forgetting that one was initially working in {\bf R}^d.   The point of doing this is that it cuts down on the number of possible ways to start attacking the problem.  For instance, facts such as outer regularity (every measurable set can be approximated from above by an open set) do not hold in abstract measure spaces (which do not even have a meaningful notion of an open set), and so presumably will not play a role in the solution; similarly for any facts involving boxes.  Instead, one should be trying to use general facts about measure, such as countable additivity, which are not specific to {\bf R}^d.

[It is worth noting that sometimes this abstraction method does not always work; for instance, when viewed as a measure space, {\bf R}^d is not completely arbitrary, but does have one or two features that distinguish it from a generic measure space, most notably the fact that it is \sigma-finite.  So, even if the hypotheses and conclusion of a problem about {\bf R}^d is purely measure-theoretic in nature, one may still need to use some measure-theoretic facts specific to {\bf R}^d.  Here, it becomes useful to know a little bit about the classification of measure spaces to have some intuition as to how “generic” a measure space such as {\bf R}^d really is.  This intuition is hard to convey at this level of the subject, but in general, measure spaces form a very “non-rigid” category, with very few invariants, and so it is largely true that one measure space usually behaves much the same as any other.]

Another example of abstraction: suppose that a problem involves a large number of sets (e.g. E_n and F_n) and their measures, but that the conclusion of the problem only involves the measures m(E_n), m(F_n) of the sets, rather than the sets themselves.  Then one can try to abstract the sets out of the problem, by trying to write down every single relationship between the numerical quantities m(E_n), m(F_n) that one can easily deduce from the given hypotheses (together with basic properties of measure, such as monotonicity or countable additivity).  One can then rename these quantities (e.g. a_n := m(E_n) and b_n := m(F_n)) to “forget” that these quantities arose from a measure-theoretic context, and then work with a purely numerical problem, in which one is starting with hypotheses on some sequences a_n, b_n of numbers and trying to deduce a conclusion about such sequences.  Such a problem is often easier to solve than the original problem due to the simpler context.  Sometimes, this simplified problem will end up being false, but the counterexample will often be instructive, either in indicating the need to add an additional hypothesis connecting the a_n, b_n, or to indicate that one cannot work at this level of abstraction but must introduce some additional concrete ingredient.

A real-life example: a student once asked me how to justify the claim that

\displaystyle e^{\pi i\lambda^{-1}\langle T^{-1}\xi,\xi\rangle} =\sum_{j=0}^N\frac{(\pi i\lambda^{-1}\langle T^{-1}\xi,\xi\rangle)^j}{j!} +O(\frac{|\xi|^{2N+2}}{\lambda^{N+1}})

uniformly for \xi \in {\bf R}^d and \lambda > 0, where T: {\bf R}^d \to {\bf R}^d is a fixed linear operator and N is a fixed natural number.  This expression looks very complicated, but if one observes that the expression x := \pi \lambda^{-1}\langle T^{-1}\xi,\xi\rangle is common to both sides, and that this expression is of magnitude O( |\xi|^2 / \lambda ), one sees that actually one can abstract this claim to the simpler claim

\displaystyle e^{ix} = \sum_{j=0}^N \frac{(ix)^j}{j!} + O( x^{N+1} ).

At this point it is clear what to do: use Taylor’s theorem with remainder.

Note that this trick is in many ways the antithesis of Trick 9, because by passing to a special case, one often makes the problem more concrete, with more things that one is now able to start trying.  However, the two tricks can work together.  One particularly useful “advanced move” in mathematical problem solving is to first abstract the problem to a more general one, and then consider a special case of that more abstract problem which is not directly related to the original one, but is somehow simpler than the original while still capturing some of the essence of the difficulty.   Attacking this alternate problem can then lead to some indirect but important ways to make progress on the original problem.

11.  Exploit Zeno’s paradox: a single epsilon can be cut up into countably many sub-epsilons.

A particularly useful fact in measure theory is that one can cut up a single epsilon into countably many pieces, for instance by using the geometric series identity

\varepsilon = \varepsilon/2 + \varepsilon/4 + \varepsilon/8 + \ldots;

this observation arguably goes all the way back to Zeno.  As such, even if one only has an epsilon of room budgeted for a problem, one can still use this budget to do a countably infinite number of things.  This fact underlies many of the countable additivity and subadditivity properties in measure theory, and also makes the ability to approximate rough objects by smoother ones to be useful even when countably many rough objects need to be approximated.

In general, one should be alert to this sort of trick when one has to spend an epsilon or so on an infinite number of objects.  If one was forced to spend the same epsilon on each object, one would soon end up with an unacceptable loss; but if one can get away with using a different epsilon each time, then Zeno’s trick comes in very handy.

Needless to say, this trick combines well with Trick 2.

12.  If you expand your way to a double sum, a double integral, a sum of an integral, or an integral of a sum, try interchanging the two operations.

Or, to put it another way: “The Fubini-Tonelli theorem is your friend”.  Provided that one is either in the unsigned or absolutely convergent worlds, this theorem allows you to interchange sums and integrals with each other.  In many cases, a double sum or integral that is difficult to sum in one direction can become easier to sum (or at least to upper bound, which is often all that one needs in analysis).  In fact, if in the course of expanding an expression, you encounter such a double sum or integral, you should reflexively try interchanging the operations to see if the resulting expression looks any simpler.

Note that in some cases the parameters in the summation may be constrained, and one may have to take a little care to sum it properly.  For instance,

\sum_{n=-\infty}^\infty \sum_{m=n}^\infty a_{m,n} (1)

interchanges (assuming that the a_{n,m} are either unsigned or absolutely convergent) to

\sum_{m=-\infty}^\infty \sum_{n=-\infty}^m a_{m,n}

(why? try plotting the set of pairs (m,n) that appear in both).  If one is having trouble interchanging constrained sums or integrals, one solution is to re-express the constraint using indicator functions. For instance, one can rewrite the constrained sum (1) as the unconstrained sum

\sum_{n=-\infty}^\infty \sum_{m=-\infty}^\infty 1_{m \geq n} a_{m,n}

(extending the domain of a_{m,n} if necessary), at which point interchanging the summations is easily accomplished.

The following point is obvious, but bears mentioning explicitly: while the interchanging sums/integrals trick can be very powerful, one should not apply it twice in a row to the same double sum or double operation, unless one is doing something genuinely non-trivial in between those two applications.  So, after one exchanges a sum or integral, the next move should be something other than another exchange (unless one is dealing with a triple or higher sum or integral).

A related move (not so commonly used in measure theory, but occurring in other areas of analysis, particularly those involving the geometry of Euclidean spaces) is to merge two sums or integrals into a single sum or integral over the product space, in order to use some additional feature of the product space (e.g. rotation symmetry) that is not readily visible in the factor spaces.  The classic example of this trick is the evaluation of the gaussian integral \int_{-\infty}^\infty e^{-x^2}\ dx by squaring it, rewriting that square as the two-dimensional gaussian integral \int_{{\bf R}^2} e^{-x^2-y^2}\ dx dy, and then switching to polar coordinates.

13.  Pointwise control, uniform control, and integrated (average) control are all partially convertible to each other.

There are three main ways to control functions (or sequences of functions, etc.) in analysis.  Firstly, there is pointwise control, in which one can control the function at every point (or almost every point), but in a non-uniform way.  Then there is uniform control, where one can control the function in the same way at most points (possibly throwing out a set of zero measure, or small measure).  Finally, there is integrated control (or control “on the average”), in which one controls the integral of a function, rather than the pointwise values of that function.

It is important to realise that control of one type can often be partially converted to another type.  Simple examples include the deduction of pointwise convergence from uniform convergence, or integrating a pointwise bound f(x) \leq g(x) to obtain an integrated bound \int f \leq \int g.  Of course, these conversions are not reversible and thus lose some information; not every pointwise convergent sequence is uniformly convergent, and an integral bound does not imply a pointwise bound.  However, one can partially reverse such implications if one is willing to throw away an exceptional set (Trick 7).  For instance, Egorov’s theorem lets one convert pointwise convergence to (local) uniform convergence after throwing away an exceptional set, and Markov’s inequality lets one convert integral bounds to pointwise bounds, again after throwing away an exceptional set.

14.  If the conclusion and hypotheses look particularly close to each other, just expand out all the definitions and follow your nose.

This trick is particularly useful when building the most basic foundations of a theory.  Here, one may not need to experiment too much with generalisations, abstractions, or special cases, or try to marshall a lot of possibly relevant facts about the objects being studied: sometimes, all one has to do is go back to first principles, write out all the definitions with their epsilons and deltas, and start plugging away at the problem.

Knowing when to just follow one’s nose, and when to instead look for a more high-level approach to a problem, can require some judgement or experience.  A direct approach tends to work best when the conclusion and hypothesis already look quite similar to each other (e.g. they both state that a certain set or family of sets is measurable, or they both state that a certain function or family of functions is continuous, etc.).  But when the conclusion looks quite different from the hypotheses (e.g. the conclusion is some sort of integral identity, and the hypotheses involve measurability or convergence properties), then one may need to use more sophisticated tools than what one can easily get from using first principles.

15.  Don’t worry too much about exactly what \varepsilon (or \delta, or N, etc.) needs to be.  It can usually be chosen or tweaked later if necessary.

Often in the middle of an argument, you will want to use some fact that involves a parameter, such as \varepsilon, that you are completely free to choose (subject of course to reasonable constraints, such as requiring \varepsilon to be positive).  For instance, you may have a measurable set and decide to approximate it from above by an open set of at most \varepsilon more measure.  But it may not be obvious exactly what value to give this parameter \varepsilon; you have so many choices available that you don’t know which one to pick!

In many cases, one can postpone thinking about this problem by leaving \varepsilon undetermined for now, and continuing on with one’s argument, which will gradually start being decorated with \varepsilon‘s all over the place.  At some point, one will need \varepsilon to do something (and, in the particular case of \varepsilon, “doing something” almost always means “being small enough”), e.g. one may need 3n\varepsilon to be less than \delta, where n, \delta are some other positive quantities in one’s problem that do not depend on \varepsilon.  At this point, one could now set \varepsilon to be whatever is needed to get past this step in the argument, e.g. one could set \varepsilon to equal \delta/4n.  But perhaps one still wishes to retain the freedom to set \varepsilon because it might come in handy later.  In that case, one sets aside the requirement “3n \varepsilon < \delta” and keeps going.  Perhaps a bit later on, one might need \varepsilon to do something else; for instance, one might also need 5\varepsilon \leq 2^{-n}.  Once one has compiled the complete “wish list” of everything one wishes one’s parameters to do, then one can finally make the decision of what value to set those parameters equal to.  For instance, if the above two inequalities are the only inequalities required of \varepsilon, one can choose \varepsilon equal to \min( \delta/4n, 2^{-n}/5).  This may be a choice of \varepsilon which was not obvious at the start of the argument, but becomes so as the argument progresses.

There is however one big caveat when adopting this “choose parameters later” approach, which is that one needs to avoid a circular dependence of constants.  For instance, it is perfectly fine to have two arbitrary parameters \varepsilon and \delta floating around unspecified for most of the argument, until at some point you realise that you need \varepsilon to be smaller than \delta, and so one chooses \varepsilon accordingly (e.g. one sets it to equal \delta/2).  Or, perhaps instead one needs \delta to be smaller than \varepsilon, and so sets \delta equal to \varepsilon/2.  One can execute either of these two choices separately, but of course one cannot perform them simultaneously; this sets up an inconsistent circular dependency in which \varepsilon needs to be defined after \delta is chosen, and \delta can only be chosen after \varepsilon is fixed.  So, if one is going to delay choosing a parameter such as \varepsilon until later, it becomes important to mentally keep track of what objects in one’s argument depend on \varepsilon, and which ones are independent of \varepsilon.  One can choose \varepsilon in terms of the latter quantities, but one usually cannot do so in terms of the former quantities (unless one takes the care to show that the interlinked constraints between the quantities are still consistent, and thus simultaneously satisfiable).

See also the Tricki article “Keep parameters unspecified until it is clear how to optimize them“.

16.  Once one has started to lose some constants (or some epsilons), don’t be hesitant to lose some more.

Many techniques in analysis end up giving inequalities that are inefficient by a constant factor.  For instance, any argument involving dyadic decomposition and powers of two tends to involve losses of factors of 2.  When arguing using balls in Euclidean space, one sometimes loses factors involving the volume of the unit ball (although this factor often cancels itself out if one tracks it more carefully).  And so forth.  However, in many cases these constant factors end up being of little importance: an upper bound of 2\varepsilon or 100\varepsilon is often just as good as an upper bound of \varepsilon for the purposes of analysis (cf. Trick 15).  So it is often best not to invest too much energy in carefully computing and optimising these constants; giving these constants a symbol such as C, and not worrying about their exact value, is often the simplest approach.  (One can also use asymptotic notation, such as O(), which is very convenient to use once you know how it works.)

Now there are some cases in which one really does not want to lose any constants at all.  For instance, if one is using Trick 1 to prove that X=Y, it is not enough to show that X \leq 2Y and Y \leq 2X; one really needs to show X \leq Y and Y \leq X without losing any constants.  (But proving X \leq (1+\varepsilon)Y and Y \leq (1+\varepsilon)X for every \varepsilon > 0 is OK, by Trick 2.)  But once one has already performed one step that loses a constant, there is little further to be lost by losing more; there can be a big difference between X \leq Y and X \leq 2Y, but there is little difference in practice between X \leq 2Y and X \leq 100Y, at least for the purposes of mathematical analysis.  At that stage, one should put oneself in the mental mode of thought where “constants don’t matter”, which can lead to some simplifications.  For instance, if one has to estimate a sum X+Y of two positive quantities, one can start using such estimates as

\max(X,Y) \leq X+Y \leq 2 \max(X,Y),

which says that, up to afactor of 2, X+Y is the same thing as \max(X,Y).  In some cases the latter is easier to work with (e.g. \max(X,Y)^n is equal to \max(X^n,Y^n), whereas the formula for (X+Y)^n is messier).

A related principle: if one is in the common situation where one can afford to have some error terms, then often approximate formulae can be more useful than exact formulae.  For instance, a Taylor expansion with remainder such as f(x_0+h) = f(x_0) + f'(x_0) h + \frac{1}{2} f''(x_0) h^2 + O(h^3) can often be more useful than the full Taylor series f(x_0+h) = \sum_{n=0}^\infty \frac{1}{n!} f^{(n)}(x_0) h^n.  As a general rule of thumb, once one has a formula that is accurate up to the order of magnitude of error one is willing to tolerate, it is usually not very productive to try to refine the formula much further to improve the accuracy.

In a similar spirit, if one wants to optimise some expression (such as \varepsilon \lambda e^\lambda + 1/\lambda) in a parameter such as \lambda > 0 (where in this example \varepsilon > 0 is a small quantity), and one is willing to lose a constant factor, then it is often not necessary to perform the undergraduate calculus steps of differentiating in the parameter and setting the derivative equal to zero; one can instead just test a few values of the parameter \lambda and see what produces a near-optimal result.  For instance, setting any \lambda larger than \log \frac{1}{\varepsilon} will give a terrible first term, but setting \lambda = \log \frac{1}{\varepsilon} gives a first term that is only of size \log \frac{1}{\varepsilon}, and a second term that is the smaller size 1/\log \frac{1}{\varepsilon}.  To balance this one can instead set \lambda = \log \frac{1}{\varepsilon} - 2 \log\log \frac{1}{\varepsilon}, and now one sees that both terms are of size O( 1/\log \frac{1}{\varepsilon} ).  One can also easily show that it is not possible for both terms to simultaneously be significantly smaller than O( 1/\log \frac{1}{\varepsilon} ). so the optimal bound here is indeed O( 1/\log \frac{1}{\varepsilon} ).  (Now try to arrive at the same conclusion using undergraduate calculus.)

17.  One can often pass to a subsequence to improve the convergence properties.

In real analysis, one often ends up possessing a sequence of objects, such as a sequence of functions f_n, which may converge in some rather slow or weak fashion to a limit f.  Often, one can improve the convergence of this sequence by passing to a subsequence.  For instance:

  1. In a metric space, if a sequence x_n converges to a limit x, then one can find a subsequence x_{n_j} which converges quickly to the same limit x, for instance one can ensure that d(x_{n_j},x) \leq 2^{-j} (or one can replace 2^{-j} with any other positive expression depending on j).  In particular, one can make \sum_{j=1}^\infty d(x_{n_j},x) and \sum_{j=1}^\infty d(x_{n_j},x_{n_{j+1}}) absolutely convergent, which is sometimes useful.
  2. A sequence of functions that converges in L^1 norm or in measure can be refined to a subsequence that converges pointwise almost everywhere as well.
  3. A sequence in a (sequentially) compact space may not converge at all, but some subsequence of it will always converge.
  4. The pigeonhole principle: A sequence which takes only finitely many values has a subsequence that is constant.  More generally, a sequence which lives in the union of finitely many sets has a subsequence that lives in just one of these sets.

Often, the subsequence is good enough for one’s applications, and there are also a number of ways to get back from a subsequence to the original sequence:

  1. In a metric space, if you know that x_n is a Cauchy sequence, and some subsequence of x_n already converges to x, then this drags the entire sequence with it, i.e. x_n converges to x also.
  2. The Urysohn subsequence principle: in a topological space, if every subsequence of a sequence x_n itself has a subsequence that converges to a limit x, then the entire sequence converges to x.

18.  A real limit can be viewed as a meeting of the limit superior and limit inferior.

A sequence x_n of real numbers does not necessarily have a limit \lim_{n \to \infty} x_n, but the limit superior \limsup_{n \to \infty} x_n := \inf_N \sup_{n>N} x_n and the limit inferior \liminf_{n \to \infty} x_n = \sup_N \inf_{n>N} x_n always exist (though they may be infinite), and can be easily defined in terms of infima and suprema.  Because of this, it is often convenient to work with the lim sup and lim inf instead of a limit.  For instance, to show that a limit \lim_{n \to \infty} x_n exists, it suffices to show that

\limsup_{n \to \infty} x_n \leq \liminf_{n \to \infty} x_n + \varepsilon

for all \varepsilon > 0.   In a similar spirit, to show that a sequence x_n of real numbers converges to zero, it suffices to show that

\limsup_{n \to \infty} |x_n| \leq \varepsilon

for all \varepsilon > 0.  It can be more convenient to work with lim sups and lim infs instead of limits because one does not need to worry about the issue of whether the limit exists or not, and many tools (notably Fatou’s lemma and its relatives) still work in this setting.  One should however be cautious that lim sups and lim infs tend to have only one half of the linearity properties that limits do; for instance, lim sups are subadditive but not necessarily additive, while lim infs are superadditive but not necessarily additive.

19.  Be aware of and exploit the symmetries of the problem, for instance to normalise a key mathematical object, or to locate instructive limiting cases.

In many problems in analysis enjoy some sort of symmetry (such as translation symmetry or homogeneity) in which the truth of the statement to be proven is clearly unchanged (or nearly unchanged) if one applies this symmetry to the parameters of the problem.  For instance, if one wishes to establish an estimate of the form

\| Tf \|_Y \leq C \|f\|_X

for all f \in X, then one can notice that this estimate enjoys a homogeneity symmetry, in that one can multiply f by an arbitrary non-zero constant c without changing the truth value of the above estimate, since the inequality

\| T(cf) \|_Y \leq C \|cf\|_X

is clearly equivalent to the original inequality.

Being aware of such symmetries can have many advantages.   Most obviously, one can spend such symmetries to achieve a convenient normalisation; for instance, in the above problem, one can spend the homogeneity symmetry to obtain the normalisation \|f\|_X = 1 (in some rare cases, \|Tf\|_Y = 1 would be a more convenient normalisation).    Once one enforces such a normalisation, though, the symmetry is “spent” and cannot be simultaneously used to achieve a different normalisation that is incompatible with the previous one (for instance one cannot simultaneously normalise \|f\|_X=1 and \|Tf\|_Y=1).  It is thus desirable to collect as many symmetries of the problem as one can, so that one can normalise as much as possible (but one has to check that spending one symmetry to achieve a normalisation does not destroy the other symmetries one is planning to spend).  For instance, to prove Holder’s inequality

\|fg\|_{L^r(X,\mu)} \leq \|f\|_{L^p(X,\mu)} \|g\|_{L^q(X,\mu)}

one can use the homogeneity in f and g separately to simultaneously achieve the normalisations \|f\|_{L^p(X,\mu)}=\|g\|_{L^q(X,\mu)}=1; but one can also use the symmetry f \mapsto |f|, g \mapsto |g| to make f,g non-negative, and also the symmetry f \mapsto |f|^c, g \mapsto |g|^c, p \mapsto p/c, q \mapsto q/c, r \mapsto r/c to normalise one of the exponent, say r, to be equal to 1.  Finally there is a symmetry f \mapsto w^{-1/p} f, g \mapsto w^{-1/q} g, d\mu \mapsto w\ d\mu arising from reweighting the measure, which can be used (for instance) to achieve a normalisation f=g, which can be convenient for some proofs of this inequality (see this previous blog post for more discussion).

A more advanced move is to arbitrage an imbalance in symmetry in a statement to strengthen it, or to locate interesting asymptotic special cases; I discuss these sorts of tricks in this post.  This is particularly powerful with respect to the symmetry of taking tensor products, where it is known as the “tensor power trick”.  As another example, if one wishes to prove a nonlinear inequality of the form

F(f+g) \leq F(f)

for some given f and all g in some vector space, one can exploit the imbalance of symmetry with respect to dilations g \mapsto \varepsilon g to rewrite this as

F(f+\varepsilon g) \leq F(f)

and then by sending \varepsilon \to 0 one obtains some useful linearisations of the problem which can then be studied first (e.g., if F is differentiable, one has the sub-problem of checking that F'(f)=0, and if F is twice differentiable, there is also the sub-problem of checking that F''(f) is negative semidefinite).

Symmetries can also be used to detect errors (particularly with regards to exponents), much as dimensional analysis is used in physics; see also this related Tricki article.

Because symmetries are so useful, it is often worth abstracting or generalising the problem to make these symmetries more evident; thus this trick combines well with Trick 10.

20.  To show that a “completed” mathematical object exists, one can first build an “incomplete” mathematical object, find a way to make it more complete, and appeal to Zorn’s lemma to conclude.

This trick is so standard in analysis that one sometimes sees lemmas in the literature whose entire proof is basically “Apply Zorn’s lemma in the obvious fashion.”  The Hahn-Banach theorem is a well known example of this, but there are many others (e.g. the proof of existence of non-principal ultrafilters, discussed in this previous post).  See my blog post on Zorn’s lemma for more discussion.  One subtlety when applying this lemma is that one has to check that the class of incomplete objects one is applying Zorn’s lemma to is actually a set rather than a proper class; see this previous blog post for more discussion.

Zorn’s lemma of course relies on the axiom of choice, and the objects produced by this lemma are generally considered to be “nonconstructive”, but this should not deter one from using it; in practical applications (not involving pathological or hugely infinite objects), one can often rework an argument using Zorn’s lemma into a lengthier but more constructive one that does not require the axiom of choice.

21.  To demonstrate that all objects in a certain class obey a certain property, first show that this property is closed under various basic operations of the class, and then establish the property for “generating” elements of the class.

This is a variant of Trick 3.  It can be illustrated by some basic examples:

  1. (Induction)  To show that a property P(n) holds for all natural numbers n, first show that it is preserved under incrementation (if P(n) holds, then P(n+1) holds), then check the base case P(0) arising from the generator n=0.
  2. (Backwards induction)  To show that a property P(n) holds for all numbers 1 \leq n \leq N, first show that it is preserved under decrementation (if P(n) holds for 1 < n \leq N, then P(n-1) holds), then check the base case P(N) arising from the generator n=N.
  3. (Strong induction)  To show that a property P(n) holds for all natural numbers n, show that it is preserved under least strict upper bounds (if P(n) holds for all n<n_0, then P(n_0) holds).  In this case an empty set of generators suffices, so one can omit the base case step.  (This also works more generally if the natural numbers are replaced by a well-ordered set.)
  4. (Transfinite induction)  To show that a property P(\alpha) holds for all \alpha in a well-ordered set, show that it is preserved under least upper bounds and incrementation, and verify the base case.
  5. (Continuous induction) To show that a property P(x) holds for all x in a metric space, show that the property is preserved under limits (if x_n \to x and P(x_n) holds, then P(x) holds), is open (if P(x) holds, then P(y) holds for all y sufficiently close to x), and verify a base case P(x_0) for at least one point x_0 \in X.  (Continuous induction also works in more general topological spaces than metric spaces, though if the space is not first countable one may need to work with limits of nets instead of limits of sequences.)
  6. (Dense subclass argument)  To show that a property P(x) holds for all x in a topological space X, show that it is preserved under limits, then check it for a dense subclass.  (This argument is also known as the “density argument” or “limiting argument”.)  Again, for spaces that are not first countable, one has to work with limits of nets rather than limits of sequences.
  7. (Linear algebra generation)  To show that a property P(v) holds for all vectors v in a vector space V, show that the property is preserved under linear combinations (thus if P(v), P(w) hold, then P(v+w) holds, as well as P(cv) for any scalar v), and also verify P(v) for an algebraic basis (or Hamel basis) of V.  This sort of argument can also be adapted to other bases of V than Hamel bases, but then one often needs to also verify that the property P is preserved under limits.
  8. (Sigma algebra generation)  To show that a property P(E) holds for all E in a \sigma-algebra {\mathcal B}, show that the property is preserved under countable unions, countable intersections, and complements, and then verify P for the empty set and for a generating set of the \sigma-algebra (for instance, if {\mathcal B} is a Borel \sigma-algebra, one can take open sets, closed sets, or compact sets as the generating sets; in some cases one can also take balls, cubes, or rectangles); see Remark 4 of Notes 3.    One can relax the set of properties required, for instance countable intersections is redundant by de Morgan’s laws if one already has countable unions and complements, and by the monotone class lemma one only has to verify monotone countable intersections and unions (and drop complements) if the generating set is already a Boolean algebra.  (In probability theory, the \pi-\lambda theorem of Dynkin plays a similar role to the monotone class lemma.)
  9. (Convex generation)  To show that a property P(v) holds for all v in a finite-dimensional compact convex set K, show that it holds for convex linear combinations, then test it on all extreme points.  The same argument also works for infinite dimensions if one allows for convex integrals rather than convex combinations, thanks to Choquet’s theorem; alternatively, by the Krein-Milman theorem, one can keep working with finite convex combinations as long the property is closed under limits.
  10. (Stone-Weierstrass generation)  To show that a property P(f) holds for all continuous functions f on a compact space X, show that the property is closed under algebraic operations (linear operations, pointwise multiplication, and also complex conjugation, if one is working with complex-valued functions), and also is true for a class of functions that is large enough to separate points in the space, so that the Stone-Weierstrass theorem may be applied.
  11. (Symmetry reduction)  To show that a property P(x) holds for all x in a space X, which has a group G acting on it, it suffices to show that the property P is invariant under the group action (if P(x) is true, then P(gx) is true for any g \in G, and then verify P for one representative from each orbit of G.  (This reduction overlaps significantly with Trick 19.)

One caveat with applying these sorts of reductions: it is necessary that the predicate P() be precisely defined to avoid vagueness paradoxes.  In particular, the predicate P() should not involve asymptotic notation such as O() or o().