I’ve just come back from the 48th Annual IEEE Symposium on the Foundations of Computer science, better known as FOCS; this year it was held at Providence, near Brown University. (This conference is also being officially reported on by the blog posts of Nicole Immorlica, Luca Trevisan, and Scott Aaronson.) I was there to give a tutorial on some of the tools used these days in additive combinatorics and graph theory to distinguish structure and randomness. In a previous blog post, I had already mentioned that my lecture notes for this were available on the arXiv; now the slides for my tutorial are available too (it covers much the same ground as the lecture notes, and also incorporates some material from my ICM slides, but in a slightly different format).
In the slides, I am tentatively announcing some very recent (and not yet fully written up) work of Ben Green and myself establishing the Gowers inverse conjecture in finite fields in the special case when the function f is a bounded degree polynomial (this is a case which already has some theoretical computer science applications). I hope to expand upon this in a future post. But I will describe here a neat trick I learned at the conference (from the FOCS submission of Bogdanov and Viola) which uses majority voting to enhance a large number of small independent correlations into a much stronger single correlation. This application of majority voting is widespread in computer science (and, of course, in real-world democracies), but I had not previously been aware of its utility to the type of structure/randomness problems I am interested in (in particular, it seems to significantly simplify some of the arguments in the proof of my result with Ben mentioned above); thanks to this conference, I now know to add majority voting to my “toolbox”.
Here is a simple instance of the majority voting trick in action. Suppose one is given a boolean function , where one should think of the number
of inputs as being very large. Here of course
is the field of two elements. Given any such function, and any shift
, one can form its derivative
by the formula
. (Actually, since
has characteristic 2, one could write f(x+h) + f(x) here instead of f(x+h) – f(x), but I will retain the subtraction in order to emphasise the analogy with the classical concept of differentiation.) Now, it turns out that if a function exhibits “polynomial behaviour”, than the derivatives of a function are simpler than the function itself. For instance:
- If f is a constant function, then
is zero for all h.
- If f is a linear functional, then
is constant for all h.
- If f is a quadratic form, then
is an (affine-)linear functional for all h.
- More generally, if f is a polynomial in n variables over
of degree d, then for every h,
is a polynomial of degree d-1. (This is of course totally consistent with one’s experience with differentiation of polynomials in the classical setting.)
Because of this reduction-of-complexity phenomenon, it becomes desirable to represent a function f(x) “efficiently” in terms of its derivatives if at all possible. What do I mean by “efficient”? Well, one has the rather silly representation formula
which allows us to reconstruct the value of f(x) for any x given the value f(0) at 0, and the value of for some h. But this is a very “inefficient” representation, because it will require us to record the derivatives
for all h. In applications, we want some representation of f that requires only knowing
for a few values of h.
In the case when f is a random function, with all the values of f being independent and uniformly distributed between 0 and 1, this is basically not possible, for the intuitively obvious reason that none of the values of
are particularly well correlated to each other, or to f(x). But if f is not random, one can hope to do better. As an example of this, let me mention the following result (a simple instance of a lemma of Bogdanov and Viola):
Lemma. Let
be a boolean function, and let
. Then at least one of the following statements is true:
- (Pseudorandomness) f is unbiased up to error
, in the sense that f(x) takes the value 1 for
of the possible inputs x (and similarly for the value 0, of course); or
- (Structure) f can be well approximated by a few of its derivatives. More precisely, there exists
for some
and some boolean function
such that
for at least
of the inputs x.
This result seems to be rather difficult to prove unless one knows about the majority voting trick, in which case it is rather easy. It works like this. Suppose that f is not pseudorandom; let’s say for sake of argument that it takes the value 1 with probability at least , and the value 0 with probability at most
. The key point is then that the functions f and f-1 have significantly different distributions (or histograms); indeed, f-1 takes 1 with probability at most
, and 0 with probability at least
. So there is a gap of at least
between the distributions of f and f-1, with the former favoring 1 and the latter favoring 0.
How can we leverage such a tiny gap? Well, we begin with the tautological identity
.
We apply this identity with x fixed, and h selected uniformly at random; thus x+h is also a uniformly distributed random variable. As a consequence, the random variable either has the distribution of f (if f(x)=0) or f-1 (if f(x)=1). If we can tell which of these two distributions
enjoys, we have discovered what f(x) is.
So now, the procedure is obvious to anyone who knows about statistics: take an empirical distribution! In other words, just pick some shifts at random for some large k (which we will take to be odd, in order to avoid ties), and measure the k values
. Then perform a majority vote, i.e. ask whether more of the
are equal to 1, or more are equal to 0. If the majority vote is 1, then it is more likely that
had the distribution of f and f(x)=0; conversely, if the majority vote is 0, then it is more likely that
had the distribution of f-1 and f(x)=1. Thus we see that the majority vote of these k derivatives gives a prediction for the value of f. At this point it is not difficult to use basic laws of probability (in particular, the law of large numbers) to show that this prediction becomes increasingly accurate as k increases, and the lemma can then be proven rather quickly (and one can take k to be something civilised like
).
The point here was that each of the functions had a small correlation (about
) with f. But, because these functions were all independent of each other (due to the independent random choices of the
), one could use majority vote to amplify these small correlations into a much stronger correlation.
[On a somewhat ironic note, during my tutorial I specifically commented that majority vote seemed to not be a useful tool for dealing with small correlations, since "if only 1% of the population is voting the right way, majority voting don't always yield the right outcome". But what I realise now is that if 1% of the population votes the right way, and the other 99% is behaving in a sufficiently random manner, then the influence of the 1% "voting bloc" can be decisive, allowing majority vote to be a useful tool for amplifying a correlation.]

12 comments
Comments feed for this article
23 October, 2007 at 8:20 pm
John
Thank you for the clear explanation of the lemma. Yes, amusing irony.
24 October, 2007 at 4:19 am
van vu
Dear Terry,
It sounds like a general principle used frequently in probabilistic combinatorics: if two random variables has separated means, then few
samples will tell them appart. Best, Van.
24 October, 2007 at 10:24 am
Anonymous
An historical question regarding your slides. Can you trace back the idea of decomposing an object into (pseudo) random part and structural part. Is this an older idea. Is the specific version for functions has some specific features that goes beyond, or is it just another formulation?
24 October, 2007 at 10:43 am
Terence Tao
That’s a good question. Certainly there are many classical results (e.g. orthonormal projections onto a subspace of a Hilbert space; spectral decomposition into continuous and singular components of a self-adjoint operator; Koopman-von Neumann decomposition of a shift operator into almost periodic and weakly mixing components) which have this flavour, as well as many decompositions used in signal processing (e.g. thresholding).
The method of real interpolation, used for instance to prove the Marcinkeiwicz interpolation theorem, also is in the same spirit; more generally, there are several “stopping time” algorithms in harmonic analysis which have a little bit of resemblance to the energy-increment algorithms discussed here, though the Szemeredi regularity lemma argument is a much closer fit (and was a direct inspiration for the formulations of these theorems as presented in my notes).
There are also some structural theorems for sequences associated to martingales (I don’t recall the exact details right now, though). In number theory, we have the decomposition of the circle in to major and minor arcs, which is particularly important in the Hardy-Littlewood circle method. In linear PDE, such as in the study of the Schrodinger equation, there is the decomposition of states into bound states and scattering states, which is also falls somewhat in this framework.
25 October, 2007 at 1:43 am
Emmanuel Kowalski
A pretty early form of “structure + randomness”, though in a somewhat distinct flavor, is Riemann’s explicit formula for the prime counting function (proved by von Mangoldt): the structure term is just the logarithmic integral $$li(x)=\int_2^x{dt/\log t}$$, showing the $ 1/log(t)$ density predicted by Gauss, and the (hopefully) random part is a sum over zeros of the Riemann zeta function. (Though it doesn’t really converge if written this way; other variants such as that for the sum of $\log p$ for $p<x$ are cleaner).
25 October, 2007 at 2:00 pm
Doug
I find it exciting to learn that there is an interchange of ideas among the spectrum of mathematics [at least potential Fields, Nevalinna and Gauss awardees].
There is hope yet that mathematics may be unified.
I am not surprised that your description of Probability Space, although not identical, is very similar to that found in Appendix B: Some Notions of Probability Theory, Tamer Basar and Geert Jan Olsder, Dynamic Noncooperative Game Theory (Classics in Applied Mathematics) to supplement their discussion on stochastic game theory.
26 October, 2007 at 3:59 am
Prashant V
Dear Terry,
You say that one of the main goals of the majority voting trick is to enhance a large number of small independent correlations into a much stronger single correlation. So is this essentially the same thing as “filtering out noise?”
Do you think that this trick would make it easier in dealing with chaotic systems (by reducing noise)? How does noise reduction using the majority voting trick compare to noise reduction via. averaging/weighted averaging?
26 October, 2007 at 5:53 am
Terence Tao
Dear Prashant,
One can indeed view majority vote as a noise-filtering algorithm, albeit for discrete (boolean) signals rather than continuous (analog) signals. It is also nonlinear, as opposed to averaging which is usually a linear operation. But the purposes of the two methods are still broadly similar.
26 October, 2007 at 5:54 am
Kay
“But what I realise now is that if 1% of the population votes the right way, and the other 99% is behaving in a sufficiently random manner, then the influence of the 1% “voting bloc” can be decisive, allowing majority vote to be a useful tool for amplifying a correlation.”
Dear Terry,
As I understand it, this is actually a very common argument used by economists and political scientists for why democracy isn’t quite the same as ‘mob rule’ … in the sense that, people who don’t know about the issue would vote randomly and people who know about the issue will vote in a more patterned way. There are systematic biases in how people vote that is introduced by campaign advertising and culture so this does not all stand up in practical situations but it is a mitigating factor.
7 November, 2007 at 8:52 am
An update on the inverse conjecture for the Gowers norm over finite fields « What’s new
[...] finite fields, Gowers norm, inverse conjecture, polynomials, Ramsey theory, symmetric polynomials Recently, I had tentatively announced a forthcoming result with Ben Green establishing the “Gowers inverse conjecture” (or [...]
23 November, 2007 at 1:50 pm
The distribution of polynomials over finite fields, with applications to the Gowers norms « What’s new
[...] to the arXiv, and submitted to Contributions to Discrete Mathematics. This paper, which we first announced at the recent FOCS meeting, and then gave an update on two weeks ago on this blog, is now in final [...]
30 January, 2008 at 2:52 pm
254A, Lecture 8: The mean ergodic theorem « What’s new
[...] respectively. For more on decompositions into structured and random components, see my FOCS lecture [...]