I’ve just come back from the 48th Annual IEEE Symposium on the Foundations of Computer science, better known as FOCS; this year it was held at Providence, near Brown University. (This conference is also being officially reported on by the blog posts of Nicole Immorlica, Luca Trevisan, and Scott Aaronson.) I was there to give a tutorial on some of the tools used these days in additive combinatorics and graph theory to distinguish structure and randomness. In a previous blog post, I had already mentioned that my lecture notes for this were available on the arXiv; now the slides for my tutorial are available too (it covers much the same ground as the lecture notes, and also incorporates some material from my ICM slides, but in a slightly different format).

In the slides, I am tentatively announcing some very recent (and not yet fully written up) work of Ben Green and myself establishing the Gowers inverse conjecture in finite fields in the special case when the function f is a bounded degree polynomial (this is a case which already has some theoretical computer science applications). I hope to expand upon this in a future post. But I will describe here a neat trick I learned at the conference (from the FOCS submission of Bogdanov and Viola) which uses majority voting to enhance a large number of small independent correlations into a much stronger single correlation. This application of majority voting is widespread in computer science (and, of course, in real-world democracies), but I had not previously been aware of its utility to the type of structure/randomness problems I am interested in (in particular, it seems to significantly simplify some of the arguments in the proof of my result with Ben mentioned above); thanks to this conference, I now know to add majority voting to my “toolbox”.

Here is a simple instance of the majority voting trick in action. Suppose one is given a boolean function f: {\Bbb F}_2^n \to {\Bbb F}_2, where one should think of the number n of inputs as being very large. Here of course {\Bbb F}_2 = \{0,1\} is the field of two elements. Given any such function, and any shift h \in {\Bbb F}_2^n, one can form its derivative D_h f: {\Bbb F}_2^n \to {\Bbb F}_2 by the formula D_h f(x) := f(x+h) - f(x). (Actually, since {\Bbb F}_2 has characteristic 2, one could write f(x+h) + f(x) here instead of f(x+h) – f(x), but I will retain the subtraction in order to emphasise the analogy with the classical concept of differentiation.) Now, it turns out that if a function exhibits “polynomial behaviour”, than the derivatives of a function are simpler than the function itself. For instance:

  1. If f is a constant function, then D_h f is zero for all h.
  2. If f is a linear functional, then D_h f is constant for all h.
  3. If f is a quadratic form, then D_h f is an (affine-)linear functional for all h.
  4. More generally, if f is a polynomial in n variables over {\Bbb F}_2 of degree d, then for every h, D_h f is a polynomial of degree d-1. (This is of course totally consistent with one’s experience with differentiation of polynomials in the classical setting.)

Because of this reduction-of-complexity phenomenon, it becomes desirable to represent a function f(x) “efficiently” in terms of its derivatives D_h f(x) if at all possible. What do I mean by “efficient”? Well, one has the rather silly representation formula

f(x) = f(0) - D_{-x} f(x)

which allows us to reconstruct the value of f(x) for any x given the value f(0) at 0, and the value of D_h f(x) for some h. But this is a very “inefficient” representation, because it will require us to record the derivatives D_h f for all h. In applications, we want some representation of f that requires only knowing D_h f for a few values of h.

In the case when f is a random function, with all the values f(x) of f being independent and uniformly distributed between 0 and 1, this is basically not possible, for the intuitively obvious reason that none of the values of D_h f(x) are particularly well correlated to each other, or to f(x). But if f is not random, one can hope to do better. As an example of this, let me mention the following result (a simple instance of a lemma of Bogdanov and Viola):

Lemma. Let f: {\Bbb F}_2^n \to {\Bbb F}_2 be a boolean function, and let 0 < \varepsilon < 1. Then at least one of the following statements is true:

  1. (Pseudorandomness) f is unbiased up to error O(\varepsilon), in the sense that f(x) takes the value 1 for 1/2 + O(\varepsilon) of the possible inputs x (and similarly for the value 0, of course); or
  2. (Structure) f can be well approximated by a few of its derivatives. More precisely, there exists h_1, \ldots, h_k \in {\Bbb F}_2^n for some k = O_\varepsilon(1) and some boolean function B: {\Bbb F}^k_2 \to {\Bbb F}_2 such that f(x) = B( D_{h_1} f(x), \ldots, D_{h_k} f(x) ) for at least 1-O(\varepsilon) of the inputs x.

This result seems to be rather difficult to prove unless one knows about the majority voting trick, in which case it is rather easy. It works like this. Suppose that f is not pseudorandom; let’s say for sake of argument that it takes the value 1 with probability at least 1/2 + \varepsilon, and the value 0 with probability at most 1/2 - \varepsilon. The key point is then that the functions f and f-1 have significantly different distributions (or histograms); indeed, f-1 takes 1 with probability at most 1/2-\varepsilon, and 0 with probability at least 1/2 + \varepsilon. So there is a gap of at least 2\varepsilon between the distributions of f and f-1, with the former favoring 1 and the latter favoring 0.

How can we leverage such a tiny gap? Well, we begin with the tautological identity

D_h f(x)  = f(x+h)-f(x).

We apply this identity with x fixed, and h selected uniformly at random; thus x+h is also a uniformly distributed random variable. As a consequence, the random variable D_h f(x) either has the distribution of f (if f(x)=0) or f-1 (if f(x)=1). If we can tell which of these two distributions D_h f(x) enjoys, we have discovered what f(x) is.

So now, the procedure is obvious to anyone who knows about statistics: take an empirical distribution! In other words, just pick some shifts h_1,\ldots,h_k at random for some large k (which we will take to be odd, in order to avoid ties), and measure the k values D_{h_1} f(x), \ldots, D_{h_k} f(x) \in {\Bbb F}_2. Then perform a majority vote, i.e. ask whether more of the D_{h_i} f(x) are equal to 1, or more are equal to 0. If the majority vote is 1, then it is more likely that D_h f(x) had the distribution of f and f(x)=0; conversely, if the majority vote is 0, then it is more likely that D_h f(x) had the distribution of f-1 and f(x)=1. Thus we see that the majority vote of these k derivatives gives a prediction for the value of f. At this point it is not difficult to use basic laws of probability (in particular, the law of large numbers) to show that this prediction becomes increasingly accurate as k increases, and the lemma can then be proven rather quickly (and one can take k to be something civilised like \varepsilon^{-2} \log \frac{1}{\varepsilon}).

The point here was that each of the functions D_{h_i} f had a small correlation (about O(\varepsilon)) with f. But, because these functions were all independent of each other (due to the independent random choices of the h_i), one could use majority vote to amplify these small correlations into a much stronger correlation.

[On a somewhat ironic note, during my tutorial I specifically commented that majority vote seemed to not be a useful tool for dealing with small correlations, since "if only 1% of the population is voting the right way, majority voting don't always yield the right outcome". But what I realise now is that if 1% of the population votes the right way, and the other 99% is behaving in a sufficiently random manner, then the influence of the 1% "voting bloc" can be decisive, allowing majority vote to be a useful tool for amplifying a correlation.]