You are currently browsing the tag archive for the ‘Cauchy-Schwarz’ tag.

Let {S} be a non-empty finite set. If {X} is a random variable taking values in {S}, the Shannon entropy {H[X]} of {X} is defined as

\displaystyle H[X] = -\sum_{s \in S} {\bf P}[X = s] \log {\bf P}[X = s].

There is a nice variational formula that lets one compute logs of sums of exponentials in terms of this entropy:

Lemma 1 (Gibbs variational formula) Let {f: S \rightarrow {\bf R}} be a function. Then

\displaystyle  \log \sum_{s \in S} \exp(f(s)) = \sup_X {\bf E} f(X) + {\bf H}[X]. \ \ \ \ \ (1)

Proof: Note that shifting {f} by a constant affects both sides of (1) the same way, so we may normalize {\sum_{s \in S} \exp(f(s)) = 1}. Then {\exp(f(s))} is now the probability distribution of some random variable {Y}, and the inequality can be rewritten as

\displaystyle  0 = \sup_X \sum_{s \in S} {\bf P}[X = s] \log {\bf P}[Y = s] -\sum_{s \in S} {\bf P}[X = s] \log {\bf P}[X = s].

But this is precisely the Gibbs inequality. (The expression inside the supremum can also be written as {-D_{KL}(X||Y)}, where {D_{KL}} denotes Kullback-Leibler divergence. One can also interpret this inequality as a special case of the Fenchel–Young inequality relating the conjugate convex functions {x \mapsto e^x} and {y \mapsto y \log y - y}.) \Box

In this note I would like to use this variational formula (which is also known as the Donsker-Varadhan variational formula) to give another proof of the following inequality of Carbery.

Theorem 2 (Generalized Cauchy-Schwarz inequality) Let {n \geq 0}, let {S, T_1,\dots,T_n} be finite non-empty sets, and let {\pi_i: S \rightarrow T_i} be functions for each {i=1,\dots,n}. Let {K: S \rightarrow {\bf R}^+} and {f_i: T_i \rightarrow {\bf R}^+} be positive functions for each {i=1,\dots,n}. Then

\displaystyle  \sum_{s \in S} K(s) \prod_{i=1}^n f_i(\pi_i(s)) \leq Q \prod_{i=1}^n (\sum_{t_i \in T_i} f_i(t_i)^{n+1})^{1/(n+1)}

where {Q} is the quantity

\displaystyle  Q := (\sum_{(s_0,\dots,s_n) \in \Omega_n} K(s_0) \dots K(s_n))^{1/(n+1)}

where {\Omega_n} is the set of all tuples {(s_0,\dots,s_n) \in S^{n+1}} such that {\pi_i(s_{i-1}) = \pi_i(s_i)} for {i=1,\dots,n}.

Thus for instance, the identity is trivial for {n=0}. When {n=1}, the inequality reads

\displaystyle  \sum_{s \in S} K(s) f_1(\pi_1(s)) \leq (\sum_{s_0,s_1 \in S: \pi_1(s_0)=\pi_1(s_1)} K(s_0) K(s_1))^{1/2}

\displaystyle  ( \sum_{t_1 \in T_1} f_1(t_1)^2)^{1/2},

which is easily proven by Cauchy-Schwarz, while for {n=2} the inequality reads

\displaystyle  \sum_{s \in S} K(s) f_1(\pi_1(s)) f_2(\pi_2(s))

\displaystyle  \leq (\sum_{s_0,s_1, s_2 \in S: \pi_1(s_0)=\pi_1(s_1); \pi_2(s_1)=\pi_2(s_2)} K(s_0) K(s_1) K(s_2))^{1/3}

\displaystyle (\sum_{t_1 \in T_1} f_1(t_1)^3)^{1/3} (\sum_{t_2 \in T_2} f_2(t_2)^3)^{1/3}

which can also be proven by elementary means. However even for {n=3}, the existing proofs require the “tensor power trick” in order to reduce to the case when the {f_i} are step functions (in which case the inequality can be proven elementarily, as discussed in the above paper of Carbery).

We now prove this inequality. We write {K(s) = \exp(k(s))} and {f_i(t_i) = \exp(g_i(t_i))} for some functions {k: S \rightarrow {\bf R}} and {g_i: T_i \rightarrow {\bf R}}. If we take logarithms in the inequality to be proven and apply Lemma 1, the inequality becomes

\displaystyle  \sup_X {\bf E} k(X) + \sum_{i=1}^n g_i(\pi_i(X)) + {\bf H}[X]

\displaystyle  \leq \frac{1}{n+1} \sup_{(X_0,\dots,X_n)} {\bf E} k(X_0)+\dots+k(X_n) + {\bf H}[X_0,\dots,X_n]

\displaystyle  + \frac{1}{n+1} \sum_{i=1}^n \sup_{Y_i} (n+1) {\bf E} g_i(Y_i) + {\bf H}[Y_i]

where {X} ranges over random variables taking values in {S}, {X_0,\dots,X_n} range over tuples of random variables taking values in {\Omega_n}, and {Y_i} range over random variables taking values in {T_i}. Comparing the suprema, the claim now reduces to

Lemma 3 (Conditional expectation computation) Let {X} be an {S}-valued random variable. Then there exists a {\Omega_n}-valued random variable {(X_0,\dots,X_n)}, where each {X_i} has the same distribution as {X}, and

\displaystyle  {\bf H}[X_0,\dots,X_n] = (n+1) {\bf H}[X]

\displaystyle - {\bf H}[\pi_1(X)] - \dots - {\bf H}[\pi_n(X)].

Proof: We induct on {n}. When {n=0} we just take {X_0 = X}. Now suppose that {n \geq 1}, and the claim has already been proven for {n-1}, thus one has already obtained a tuple {(X_0,\dots,X_{n-1}) \in \Omega_{n-1}} with each {X_0,\dots,X_{n-1}} having the same distribution as {X}, and

\displaystyle  {\bf H}[X_0,\dots,X_{n-1}] = n {\bf H}[X] - {\bf H}[\pi_1(X)] - \dots - {\bf H}[\pi_{n-1}(X)].

By hypothesis, {\pi_n(X_{n-1})} has the same distribution as {\pi_n(X)}. For each value {t_n} attained by {\pi_n(X)}, we can take conditionally independent copies of {(X_0,\dots,X_{n-1})} and {X} conditioned to the events {\pi_n(X_{n-1}) = t_n} and {\pi_n(X) = t_n} respectively, and then concatenate them to form a tuple {(X_0,\dots,X_n)} in {\Omega_n}, with {X_n} a further copy of {X} that is conditionally independent of {(X_0,\dots,X_{n-1})} relative to {\pi_n(X_{n-1}) = \pi_n(X)}. One can the use the entropy chain rule to compute

\displaystyle  {\bf H}[X_0,\dots,X_n] = {\bf H}[\pi_n(X_n)] + {\bf H}[X_0,\dots,X_n| \pi_n(X_n)]

\displaystyle  = {\bf H}[\pi_n(X_n)] + {\bf H}[X_0,\dots,X_{n-1}| \pi_n(X_n)] + {\bf H}[X_n| \pi_n(X_n)]

\displaystyle  = {\bf H}[\pi_n(X)] + {\bf H}[X_0,\dots,X_{n-1}| \pi_n(X_{n-1})] + {\bf H}[X_n| \pi_n(X_n)]

\displaystyle  = {\bf H}[\pi_n(X)] + ({\bf H}[X_0,\dots,X_{n-1}] - {\bf H}[\pi_n(X_{n-1})])

\displaystyle + ({\bf H}[X_n] - {\bf H}[\pi_n(X_n)])

\displaystyle  ={\bf H}[X_0,\dots,X_{n-1}] + {\bf H}[X_n] - {\bf H}[\pi_n(X_n)]

and the claim now follows from the induction hypothesis. \Box

With a little more effort, one can replace {S} by a more general measure space (and use differential entropy in place of Shannon entropy), to recover Carbery’s inequality in full generality; we leave the details to the interested reader.

Given two unit vectors {v,w} in a real inner product space, one can define the correlation between these vectors to be their inner product {\langle v, w \rangle}, or in more geometric terms, the cosine of the angle {\angle(v,w)} subtended by {v} and {w}. By the Cauchy-Schwarz inequality, this is a quantity between {-1} and {+1}, with the extreme positive correlation {+1} occurring when {v,w} are identical, the extreme negative correlation {-1} occurring when {v,w} are diametrically opposite, and the zero correlation {0} occurring when {v,w} are orthogonal. This notion is closely related to the notion of correlation between two non-constant square-integrable real-valued random variables {X,Y}, which is the same as the correlation between two unit vectors {v,w} lying in the Hilbert space {L^2(\Omega)} of square-integrable random variables, with {v} being the normalisation of {X} defined by subtracting off the mean {\mathbf{E} X} and then dividing by the standard deviation of {X}, and similarly for {w} and {Y}.

One can also define correlation for complex (Hermitian) inner product spaces by taking the real part {\hbox{Re} \langle , \rangle} of the complex inner product to recover a real inner product.

While reading the (highly recommended) recent popular maths book “How not to be wrong“, by my friend and co-author Jordan Ellenberg, I came across the (important) point that correlation is not necessarily transitive: if {X} correlates with {Y}, and {Y} correlates with {Z}, then this does not imply that {X} correlates with {Z}. A simple geometric example is provided by the three unit vectors

\displaystyle  u := (1,0); v := (\frac{1}{\sqrt{2}}, \frac{1}{\sqrt{2}}); w := (0,1)

in the Euclidean plane {{\bf R}^2}: {u} and {v} have a positive correlation of {\frac{1}{\sqrt{2}}}, as does {v} and {w}, but {u} and {w} are not correlated with each other. Or: for a typical undergraduate course, it is generally true that good exam scores are correlated with a deep understanding of the course material, and memorising from flash cards are correlated with good exam scores, but this does not imply that memorising flash cards is correlated with deep understanding of the course material.

However, there are at least two situations in which some partial version of transitivity of correlation can be recovered. The first is in the “99%” regime in which the correlations are very close to {1}: if {u,v,w} are unit vectors such that {u} is very highly correlated with {v}, and {v} is very highly correlated with {w}, then this does imply that {u} is very highly correlated with {w}. Indeed, from the identity

\displaystyle  \| u-v \| = 2^{1/2} (1 - \langle u,v\rangle)^{1/2}

(and similarly for {v-w} and {u-w}) and the triangle inequality

\displaystyle  \|u-w\| \leq \|u-v\| + \|v-w\|,

we see that

\displaystyle  (1 - \langle u,w \rangle)^{1/2} \leq (1 - \langle u,v\rangle)^{1/2} + (1 - \langle v,w\rangle)^{1/2}. \ \ \ \ \ (1)

Thus, for instance, if {\langle u, v \rangle \geq 1-\varepsilon} and {\langle v,w \rangle \geq 1-\varepsilon}, then {\langle u,w \rangle \geq 1-4\varepsilon}. This is of course closely related to (though slightly weaker than) the triangle inequality for angles:

\displaystyle  \angle(u,w) \leq \angle(u,v) + \angle(v,w).

Remark 1 (Thanks to Andrew Granville for conversations leading to this observation.) The inequality (1) also holds for sub-unit vectors, i.e. vectors {u,v,w} with {\|u\|, \|v\|, \|w\| \leq 1}. This comes by extending {u,v,w} in directions orthogonal to all three original vectors and to each other in order to make them unit vectors, enlarging the ambient Hilbert space {H} if necessary. More concretely, one can apply (1) to the unit vectors

\displaystyle  (u, \sqrt{1-\|u\|^2}, 0, 0), (v, 0, \sqrt{1-\|v\|^2}, 0), (w, 0, 0, \sqrt{1-\|w\|^2})

in {H \times {\bf R}^3}.

But even in the “{1\%}” regime in which correlations are very weak, there is still a version of transitivity of correlation, known as the van der Corput lemma, which basically asserts that if a unit vector {v} is correlated with many unit vectors {u_1,\dots,u_n}, then many of the pairs {u_i,u_j} will then be correlated with each other. Indeed, from the Cauchy-Schwarz inequality

\displaystyle  |\langle v, \sum_{i=1}^n u_i \rangle|^2 \leq \|v\|^2 \| \sum_{i=1}^n u_i \|^2

we see that

\displaystyle  (\sum_{i=1}^n \langle v, u_i \rangle)^2 \leq \sum_{1 \leq i,j \leq n} \langle u_i, u_j \rangle. \ \ \ \ \ (2)

Thus, for instance, if {\langle v, u_i \rangle \geq \varepsilon} for at least {\varepsilon n} values of {i=1,\dots,n}, then (after removing those indices {i} for which {\langle v, u_i \rangle < \varepsilon}) {\sum_{i,j} \langle u_i, u_j \rangle} must be at least {\varepsilon^4 n^2}, which implies that {\langle u_i, u_j \rangle \geq \varepsilon^4/2} for at least {\varepsilon^4 n^2/2} pairs {(i,j)}. Or as another example: if a random variable {X} exhibits at least {1\%} positive correlation with {n} other random variables {Y_1,\dots,Y_n}, then if {n > 10,000}, at least two distinct {Y_i,Y_j} must have positive correlation with each other (although this argument does not tell you which pair {Y_i,Y_j} are so correlated). Thus one can view this inequality as a sort of `pigeonhole principle” for correlation.

A similar argument (multiplying each {u_i} by an appropriate sign {\pm 1}) shows the related van der Corput inequality

\displaystyle  (\sum_{i=1}^n |\langle v, u_i \rangle|)^2 \leq \sum_{1 \leq i,j \leq n} |\langle u_i, u_j \rangle|, \ \ \ \ \ (3)

and this inequality is also true for complex inner product spaces. (Also, the {u_i} do not need to be unit vectors for this inequality to hold.)

Geometrically, the picture is this: if {v} positively correlates with all of the {u_1,\dots,u_n}, then the {u_1,\dots,u_n} are all squashed into a somewhat narrow cone centred at {v}. The cone is still wide enough to allow a few pairs {u_i, u_j} to be orthogonal (or even negatively correlated) with each other, but (when {n} is large enough) it is not wide enough to allow all of the {u_i,u_j} to be so widely separated. Remarkably, the bound here does not depend on the dimension of the ambient inner product space; while increasing the number of dimensions should in principle add more “room” to the cone, this effect is counteracted by the fact that in high dimensions, almost all pairs of vectors are close to orthogonal, and the exceptional pairs that are even weakly correlated to each other become exponentially rare. (See this previous blog post for some related discussion; in particular, Lemma 2 from that post is closely related to the van der Corput inequality presented here.)

A particularly common special case of the van der Corput inequality arises when {v} is a unit vector fixed by some unitary operator {T}, and the {u_i} are shifts {u_i = T^i u} of a single unit vector {u}. In this case, the inner products {\langle v, u_i \rangle} are all equal, and we arrive at the useful van der Corput inequality

\displaystyle  |\langle v, u \rangle|^2 \leq \frac{1}{n^2} \sum_{1 \leq i,j \leq n} |\langle T^i u, T^j u \rangle|. \ \ \ \ \ (4)

(In fact, one can even remove the absolute values from the right-hand side, by using (2) instead of (4).) Thus, to show that {v} has negligible correlation with {u}, it suffices to show that the shifts of {u} have negligible correlation with each other.

Here is a basic application of the van der Corput inequality:

Proposition 2 (Weyl equidistribution estimate) Let {P: {\bf Z} \rightarrow {\bf R}/{\bf Z}} be a polynomial with at least one non-constant coefficient irrational. Then one has

\displaystyle  \lim_{N \rightarrow \infty} \frac{1}{N} \sum_{n=1}^N e( P(n) ) = 0,

where {e(x) := e^{2\pi i x}}.

Note that this assertion implies the more general assertion

\displaystyle  \lim_{N \rightarrow \infty} \frac{1}{N} \sum_{n=1}^N e( kP(n) ) = 0

for any non-zero integer {k} (simply by replacing {P} by {kP}), which by the Weyl equidistribution criterion is equivalent to the sequence {P(1), P(2),\dots} being asymptotically equidistributed in {{\bf R}/{\bf Z}}.

Proof: We induct on the degree {d} of the polynomial {P}, which must be at least one. If {d} is equal to one, the claim is easily established from the geometric series formula, so suppose that {d>1} and that the claim has already been proven for {d-1}. If the top coefficient {a_d} of {P(n) = a_d n^d + \dots + a_0} is rational, say {a_d = \frac{p}{q}}, then by partitioning the natural numbers into residue classes modulo {q}, we see that the claim follows from the induction hypothesis; so we may assume that the top coefficient {a_d} is irrational.

In order to use the van der Corput inequality as stated above (i.e. in the formalism of inner product spaces) we will need a non-principal ultrafilter {p} (see e.g this previous blog post for basic theory of ultrafilters); we leave it as an exercise to the reader to figure out how to present the argument below without the use of ultrafilters (or similar devices, such as Banach limits). The ultrafilter {p} defines an inner product {\langle, \rangle_p} on bounded complex sequences {z = (z_1,z_2,z_3,\dots)} by setting

\displaystyle  \langle z, w \rangle_p := \hbox{st} \lim_{N \rightarrow p} \frac{1}{N} \sum_{n=1}^N z_n \overline{w_n}.

Strictly speaking, this inner product is only positive semi-definite rather than positive definite, but one can quotient out by the null vectors to obtain a positive-definite inner product. To establish the claim, it will suffice to show that

\displaystyle  \langle 1, e(P) \rangle_p = 0

for every non-principal ultrafilter {p}.

Note that the space of bounded sequences (modulo null vectors) admits a shift {T}, defined by

\displaystyle  T (z_1,z_2,\dots) := (z_2,z_3,\dots).

This shift becomes unitary once we quotient out by null vectors, and the constant sequence {1} is clearly a unit vector that is invariant with respect to the shift. So by the van der Corput inequality, we have

\displaystyle  |\langle 1, e(P) \rangle_p| \leq \frac{1}{n^2} \sum_{1 \leq i,j \leq n} |\langle T^i e(P), T^j e(P) \rangle_p|

for any {n \geq 1}. But we may rewrite {\langle T^i e(P), T^j e(P) \rangle = \langle 1, e(T^j P - T^i P) \rangle_p}. Then observe that if {i \neq j}, {T^j P - T^i P} is a polynomial of degree {d-1} whose {d-1} coefficient is irrational, so by induction hypothesis we have {\langle T^i e(P), T^j e(P) \rangle_p = 0} for {i \neq j}. For {i=j} we of course have {\langle T^i e(P), T^j e(P) \rangle_p = 1}, and so

\displaystyle  |\langle 1, e(P) \rangle_p| \leq \frac{1}{n^2} \times n

for any {n}. Letting {n \rightarrow \infty}, we obtain the claim. \Box

This is the final continuation of the online reading seminar of Zhang’s paper for the polymath8 project. (There are two other continuations; this previous post, which deals with the combinatorial aspects of the second part of Zhang’s paper, and this previous post, that covers the Type I and Type II sums.) The main purpose of this post is to present (and hopefully, to improve upon) the treatment of the final and most innovative of the key estimates in Zhang’s paper, namely the Type III estimate.

The main estimate was already stated as Theorem 17 in the previous post, but we quickly recall the relevant definitions here. As in other posts, we always take {x} to be a parameter going off to infinity, with the usual asymptotic notation {O(), o(), \ll} associated to this parameter.

Definition 1 (Coefficient sequences) A coefficient sequence is a finitely supported sequence {\alpha: {\bf N} \rightarrow {\bf R}} that obeys the bounds

\displaystyle  |\alpha(n)| \ll \tau^{O(1)}(n) \log^{O(1)}(x) \ \ \ \ \ (1)

for all {n}, where {\tau} is the divisor function.

  • (i) If {\alpha} is a coefficient sequence and {a\ (q) = a \hbox{ mod } q} is a primitive residue class, the (signed) discrepancy {\Delta(\alpha; a\ (q))} of {\alpha} in the sequence is defined to be the quantity

    \displaystyle  \Delta(\alpha; a \ (q)) := \sum_{n: n = a\ (q)} \alpha(n) - \frac{1}{\phi(q)} \sum_{n: (n,q)=1} \alpha(n). \ \ \ \ \ (2)

  • (ii) A coefficient sequence {\alpha} is said to be at scale {N} for some {N \geq 1} if it is supported on an interval of the form {[(1-O(\log^{-A_0} x)) N, (1+O(\log^{-A_0} x)) N]}.
  • (iii) A coefficient sequence {\alpha} at scale {N} is said to be smooth if it takes the form {\alpha(n) = \psi(n/N)} for some smooth function {\psi: {\bf R} \rightarrow {\bf C}} supported on {[1-O(\log^{-A_0} x), 1+O(\log^{-A_0} x)]} obeying the derivative bounds

    \displaystyle  \psi^{(j)}(t) = O( \log^{j A_0} x ) \ \ \ \ \ (3)

    for all fixed {j \geq 0} (note that the implied constant in the {O()} notation may depend on {j}).

For any {I \subset {\bf R}}, let {{\mathcal S}_I} denote the square-free numbers whose prime factors lie in {I}. The main result of this post is then the following result of Zhang:

Theorem 2 (Type III estimate) Let {\varpi, \delta > 0} be fixed quantities, and let {M, N_1, N_2, N_3 \gg 1} be quantities such that

\displaystyle  x \ll M N_1 N_2 N_3 \ll x

and

\displaystyle  N_1 \gg N_2, N_3

and

\displaystyle  N_1^4 N_2^4 N_3^5 \gg x^{4+16\varpi+\delta+c}

for some fixed {c>0}. Let {\alpha, \psi_1, \psi_2, \psi_3} be coefficient sequences at scale {M,N_1,N_2,N_3} respectively with {\psi_1,\psi_2,\psi_3} smooth. Then for any {I \subset [1,x^\delta]} we have

\displaystyle  \sum_{q \in {\mathcal S}_I: q< x^{1/2+2\varpi}} \sup_{a \in ({\bf Z}/q{\bf Z})^\times} |\Delta(\alpha \ast \beta; a)| \ll x \log^{-A} x.

In fact we have the stronger “pointwise” estimate

\displaystyle  |\Delta(\alpha \ast \psi_1 \ast \psi_2 \ast \psi_3; a)| \ll x^{-\epsilon} \frac{x}{q} \ \ \ \ \ (4)

for all {q \in {\mathcal S}_I} with {q < x^{1/2+2\varpi}} and all {a \in ({\bf Z}/q{\bf Z})^\times}, and some fixed {\epsilon>0}.

(This is very slightly stronger than previously claimed, in that the condition {N_2 \gg N_3} has been dropped.)

It turns out that Zhang does not exploit any averaging of the {\alpha} factor, and matters reduce to the following:

Theorem 3 (Type III estimate without {\alpha}) Let {\delta > 0} be fixed, and let {1 \ll N_1, N_2, N_3, d \ll x^{O(1)}} be quantities such that

\displaystyle  N_1 \gg N_2, N_3

and

\displaystyle d \in {\mathcal S}_{[1,x^\delta]}

and

\displaystyle  N_1^4 N_2^4 N_3^5 \gg d^8 x^{\delta+c}

for some fixed {c>0}. Let {\psi_1,\psi_2,\psi_3} be smooth coefficient sequences at scales {N_1,N_2,N_3} respectively. Then we have

\displaystyle  |\Delta(\psi_1 \ast \psi_2 \ast \psi_3; a)| \ll x^{-\epsilon} \frac{N_1 N_2 N_3}{d}

for all {a \in ({\bf Z}/d{\bf Z})^\times} and some fixed {\epsilon>0}.

Let us quickly see how Theorem 3 implies Theorem 2. To show (4), it suffices to establish the bound

\displaystyle  \sum_{n = a\ (q)} \alpha \ast \psi_1 \ast \psi_2 \ast \psi_3(n) = X + O( x^{-\epsilon} \frac{x}{q} )

for all {a \in ({\bf Z}/q{\bf Z})^\times}, where {X} denotes a quantity that is independent of {a} (but can depend on other quantities such as {\alpha,\psi_1,\psi_2,\psi_3,q}). The left-hand side can be rewritten as

\displaystyle  \sum_{b \in ({\bf Z}/q{\bf Z})^\times} \sum_{m = b\ (q)} \alpha(m) \sum_{n = a/b\ (q)} \psi_1 \ast \psi_2 \ast \psi_3(n).

From Theorem 3 we have

\displaystyle  \sum_{n = a/b\ (q)} \psi_1 \ast \psi_2 \ast \psi_3(n) = Y + O( x^{-\epsilon} \frac{N_1 N_2 N_3}{q} )

where the quantity {Y} does not depend on {a} or {b}. Inserting this asymptotic and using crude bounds on {\alpha} (see Lemma 8 of this previous post) we conclude (4) as required (after modifying {\epsilon} slightly).

It remains to establish Theorem 3. This is done by a set of tools similar to that used to control the Type I and Type II sums:

  • (i) completion of sums;
  • (ii) the Weil conjectures and bounds on Ramanujan sums;
  • (iii) factorisation of smooth moduli {q \in {\mathcal S}_I};
  • (iv) the Cauchy-Schwarz and triangle inequalities (Weyl differencing).

The specifics are slightly different though. For the Type I and Type II sums, it was the classical Weil bound on Kloosterman sums that were the key source of power saving; Ramanujan sums only played a minor role, controlling a secondary error term. For the Type III sums, one needs a significantly deeper consequence of the Weil conjectures, namely the estimate of Bombieri and Birch on a three-dimensional variant of a Kloosterman sum. Furthermore, the Ramanujan sums – which are a rare example of sums that actually exhibit better than square root cancellation, thus going beyond even what the Weil conjectures can offer – make a crucial appearance, when combined with the factorisation of the smooth modulus {q} (this new argument is arguably the most original and interesting contribution of Zhang).

Read the rest of this entry »

The following result is due independently to Furstenberg and to Sarkozy:

Theorem 1 (Furstenberg-Sarkozy theorem) Let {\delta > 0}, and suppose that {N} is sufficiently large depending on {\delta}. Then every subset {A} of {[N] := \{1,\ldots,N\}} of density {|A|/N} at least {\delta} contains a pair {n, n+r^2} for some natural numbers {n, r} with {r \neq 0}.

This theorem is of course similar in spirit to results such as Roth’s theorem or Szemerédi’s theorem, in which the pattern {n,n+r^2} is replaced by {n,n+r,n+2r} or {n,n+r,\ldots,n+(k-1)r} for some fixed {k} respectively. There are by now many proofs of this theorem (see this recent paper of Lyall for a survey), but most proofs involve some form of Fourier analysis (or spectral theory). This may be compared with the standard proof of Roth’s theorem, which combines some Fourier analysis with what is now known as the density increment argument.

A few years ago, Ben Green, Tamar Ziegler, and myself observed that it is possible to prove the Furstenberg-Sarkozy theorem by just using the Cauchy-Schwarz inequality (or van der Corput lemma) and the density increment argument, removing all invocations of Fourier analysis, and instead relying on Cauchy-Schwarz to linearise the quadratic shift {r^2}. As such, this theorem can be considered as even more elementary than Roth’s theorem (and its proof can be viewed as a toy model for the proof of Roth’s theorem). We ended up not doing too much with this observation, so decided to share it here.

The first step is to use the density increment argument that goes back to Roth. For any {\delta > 0}, let {P(\delta)} denote the assertion that for {N} sufficiently large, all sets {A \subset [N]} of density at least {\delta} contain a pair {n,n+r^2} with {r} non-zero. Note that {P(\delta)} is vacuously true for {\delta > 1}. We will show that for any {0 < \delta_0 \leq 1}, one has the implication

\displaystyle  P(\delta_0 + c \delta_0^3) \implies P(\delta_0) \ \ \ \ \ (1)

for some absolute constant {c>0}. This implies that {P(\delta)} is true for any {\delta>0} (as can be seen by considering the infimum of all {\delta>0} for which {P(\delta)} holds), which gives Theorem 1.

It remains to establish the implication (1). Suppose for sake of contradiction that we can find {0 < \delta_0 \leq 1} for which {P(\delta_0+c\delta^3_0)} holds (for some sufficiently small absolute constant {c>0}), but {P(\delta_0)} fails. Thus, we can find arbitrarily large {N}, and subsets {A} of {[N]} of density at least {\delta_0}, such that {A} contains no patterns of the form {n,n+r^2} with {r} non-zero. In particular, we have

\displaystyle  \mathop{\bf E}_{n \in [N]} \mathop{\bf E}_{r \in [N^{1/3}]} \mathop{\bf E}_{h \in [N^{1/100}]} 1_A(n) 1_A(n+(r+h)^2) = 0.

(The exact ranges of {r} and {h} are not too important here, and could be replaced by various other small powers of {N} if desired.)

Let {\delta := |A|/N} be the density of {A}, so that {\delta_0 \leq \delta \leq 1}. Observe that

\displaystyle  \mathop{\bf E}_{n \in [N]} \mathop{\bf E}_{r \in [N^{1/3}]} \mathop{\bf E}_{h \in [N^{1/100}]} 1_A(n) \delta 1_{[N]}(n+(r+h)^2) = \delta^2 + O(N^{-1/3})

\displaystyle  \mathop{\bf E}_{n \in [N]} \mathop{\bf E}_{r \in [N^{1/3}]} \mathop{\bf E}_{h \in [N^{1/100}]} \delta 1_{[N]}(n) \delta 1_{[N]}(n+(r+h)^2) = \delta^2 + O(N^{-1/3})

and

\displaystyle  \mathop{\bf E}_{n \in [N]} \mathop{\bf E}_{r \in [N^{1/3}]} \mathop{\bf E}_{h \in [N^{1/100}]} \delta 1_{[N]}(n) 1_A(n+(r+h)^2) = \delta^2 + O( N^{-1/3} ).

If we thus set {f := 1_A - \delta 1_{[N]}}, then

\displaystyle  \mathop{\bf E}_{n \in [N]} \mathop{\bf E}_{r \in [N^{1/3}]} \mathop{\bf E}_{h \in [N^{1/100}]} f(n) f(n+(r+h)^2) = -\delta^2 + O( N^{-1/3} ).

In particular, for {N} large enough,

\displaystyle  \mathop{\bf E}_{n \in [N]} |f(n)| \mathop{\bf E}_{r \in [N^{1/3}]} |\mathop{\bf E}_{h \in [N^{1/100}]} f(n+(r+h)^2)| \gg \delta^2.

On the other hand, one easily sees that

\displaystyle  \mathop{\bf E}_{n \in [N]} |f(n)|^2 = O(\delta)

and hence by the Cauchy-Schwarz inequality

\displaystyle  \mathop{\bf E}_{n \in [N]} \mathop{\bf E}_{r \in [N^{1/3}]} |\mathop{\bf E}_{h \in [N^{1/100}]} f(n+(r+h)^2)|^2 \gg \delta^3

which we can rearrange as

\displaystyle  |\mathop{\bf E}_{r \in [N^{1/3}]} \mathop{\bf E}_{h,h' \in [N^{1/100}]} \mathop{\bf E}_{n \in [N]} f(n+(r+h)^2) f(n+(r+h')^2)| \gg \delta^3.

Shifting {n} by {(r+h)^2} we obtain (again for {N} large enough)

\displaystyle  |\mathop{\bf E}_{r \in [N^{1/3}]} \mathop{\bf E}_{h,h' \in [N^{1/100}]} \mathop{\bf E}_{n \in [N]} f(n) f(n+(h'-h)(2r+h'+h))| \gg \delta^3.

In particular, by the pigeonhole principle (and deleting the diagonal case {h=h'}, which we can do for {N} large enough) we can find distinct {h,h' \in [N^{1/100}]} such that

\displaystyle  |\mathop{\bf E}_{r \in [N^{1/3}]} \mathop{\bf E}_{n \in [N]} f(n) f(n+(h'-h)(2r+h'+h))| \gg \delta^3,

so in particular

\displaystyle  \mathop{\bf E}_{n \in [N]} |\mathop{\bf E}_{r \in [N^{1/3}]} f(n+(h'-h)(2r+h'+h))| \gg \delta^3.

If we set {d := 2(h'-h)} and shift {n} by {(h'-h) (h'+h)}, we can simplify this (again for {N} large enough) as

\displaystyle  \mathop{\bf E}_{n \in [N]} |\mathop{\bf E}_{r \in [N^{1/3}]} f(n+dr)| \gg \delta^3. \ \ \ \ \ (2)

On the other hand, since

\displaystyle  \mathop{\bf E}_{n \in [N]} f(n) = 0

we have

\displaystyle  \mathop{\bf E}_{n \in [N]} f(n+dr) = O( N^{-2/3+1/100})

for any {r \in [N^{1/3}]}, and thus

\displaystyle  \mathop{\bf E}_{n \in [N]} \mathop{\bf E}_{r \in [N^{1/3}]} f(n+dr) = O( N^{-2/3+1/100}).

Averaging this with (2) we conclude that

\displaystyle  \mathop{\bf E}_{n \in [N]} \max( \mathop{\bf E}_{r \in [N^{1/3}]} f(n+dr), 0 ) \gg \delta^3.

In particular, by the pigeonhole principle we can find {n \in [N]} such that

\displaystyle  \mathop{\bf E}_{r \in [N^{1/3}]} f(n+dr) \gg \delta^3,

or equivalently {A} has density at least {\delta+c'\delta^3} on the arithmetic progression {\{ n+dr: r \in [N^{1/3}]\}}, which has length {\lfloor N^{1/3}\rfloor } and spacing {d}, for some absolute constant {c'>0}. By partitioning this progression into subprogressions of spacing {d^2} and length {\lfloor N^{1/4}\rfloor} (plus an error set of size {O(N^{1/4})}, we see from the pigeonhole principle that we can find a progression {\{ n' + d^2 r': r' \in [N^{1/4}]\}} of length {\lfloor N^{1/4}\rfloor} and spacing {d^2} on which {A} has density at least {\delta + c\delta^3} (and hence at least {\delta_0+c\delta_0^3}) for some absolute constant {c>0}. If we then apply the induction hypothesis to the set

\displaystyle  A' := \{ r' \in [N^{1/4}]: n' + d^2 r' \in A \}

we conclude (for {N} large enough) that {A'} contains a pair {m, m+s^2} for some natural numbers {m,s} with {s} non-zero. This implies that {(n'+d^2 m), (n'+d^2 m) + (|d|s)^2} lie in {A}, a contradiction, establishing the implication (1).

A more careful analysis of the above argument reveals a more quantitative version of Theorem 1: for {N \geq 100} (say), any subset of {[N]} of density at least {C/(\log\log N)^{1/2}} for some sufficiently large absolute constant {C} contains a pair {n,n+r^2} with {r} non-zero. This is not the best bound known; a (difficult) result of Pintz, Steiger, and Szemeredi allows the density to be as low as {C / (\log N)^{\frac{1}{4} \log\log\log\log N}}. On the other hand, this already improves on the (simpler) Fourier-analytic argument of Green that works for densities at least {C/(\log\log N)^{1/11}} (although the original argument of Sarkozy, which is a little more intricate, works up to {C (\log\log N)^{2/3}/(\log N)^{1/3}}). In the other direction, a construction of Rusza gives a set of density {\frac{1}{65} N^{-0.267}} without any pairs {n,n+r^2}.

Remark 1 A similar argument also applies with {n,n+r^2} replaced by {n,n+r^k} for fixed {k}, because this sort of pattern is preserved by affine dilations {r' \mapsto n'+d^k r'} into arithmetic progressions whose spacing {d^k} is a {k^{th}} power. By re-introducing Fourier analysis, one can also perform an argument of this type for {n,n+d,n+2d} where {d} is the sum of two squares; see the above-mentioned paper of Green for details. However there seems to be some technical difficulty in extending it to patterns of the form {n,n+P(r)} for polynomials {P} that consist of more than a single monomial (and with the normalisation {P(0)=0}, to avoid local obstructions), because one no longer has this preservation property.

I’ve just uploaded to the arXiv my paper “Mixing for progressions in non-abelian groups“, submitted to Forum of Mathematics, Sigma (which, along with sister publication Forum of Mathematics, Pi, has just opened up its online submission system). This paper is loosely related in subject topic to my two previous papers on polynomial expansion and on recurrence in quasirandom groups (with Vitaly Bergelson), although the methods here are rather different from those in those two papers. The starting motivation for this paper was a question posed in this foundational paper of Tim Gowers on quasirandom groups. In that paper, Gowers showed (among other things) that if {G} was a quasirandom group, patterns such as {(x,xg,xh, xgh)} were mixing in the sense that, for any four sets {A,B,C,D \subset G}, the number of such quadruples {(x,xg,xh, xgh)} in {A \times B \times C \times D} was equal to {(\mu(A) \mu(B) \mu(C) \mu(D) + o(1)) |G|^3}, where {\mu(A) := |A| / |G|}, and {o(1)} denotes a quantity that goes to zero as the quasirandomness of the group goes to infinity. In my recent paper with Vitaly, we also considered mixing properties of some other patterns, namely {(x,xg,gx)} and {(g,x,xg,gx)}. This paper is concerned instead with the pattern {(x,xg,xg^2)}, that is to say a geometric progression of length three. As observed by Gowers, by applying (a suitably quantitative version of) Roth’s theorem in (cosets of) a cyclic group, one can obtain a recurrence theorem for this pattern without much effort: if {G} is an arbitrary finite group, and {A} is a subset of {G} with {\mu(A) \geq \delta}, then there are at least {c(\delta) |G|^2} pairs {(x,g) \in G} such that {x, xg, xg^2 \in A}, where {c(\delta)>0} is a quantity depending only on {\delta}. However, this argument does not settle the question of whether there is a stronger mixing property, in that the number of pairs {(x,g) \in G^2} such that {(x,xg,xg^2) \in A \times B \times C} should be {(\mu(A)\mu(B)\mu(C)+o(1)) |G|^2} for any {A,B,C \subset G}. Informally, this would assert that for {x, g} chosen uniformly at random from {G}, the triplet {(x, xg, xg^2)} should resemble a uniformly selected element of {G^3} in some weak sense.

For non-quasirandom groups, such mixing properties can certainly fail. For instance, if {G} is the cyclic group {G = ({\bf Z}/N{\bf Z},+)} (which is abelian and thus highly non-quasirandom) with the additive group operation, and {A = \{1,\ldots,\lfloor \delta N\rfloor\}} for some small but fixed {\delta > 0}, then {\mu(A) = \delta + o(1)} in the limit {N \rightarrow \infty}, but the number of pairs {(x,g) \in G^2} with {x, x+g, x+2g \in A} is {(\delta^2/2 + o(1)) |G|^2} rather than {(\delta^3+o(1)) |G|^2}. The problem here is that the identity {(x+2g) = 2(x+g) - x} ensures that if {x} and {x+g} both lie in {A}, then {x+2g} has a highly elevated likelihood of also falling in {A}. One can view {A} as the preimage of a small ball under the one-dimensional representation {\rho: G \rightarrow U(1)} defined by {\rho(n) := e^{2\pi i n/N}}; similar obstructions to mixing can also be constructed from other low-dimensional representations.

However, by definition, quasirandom groups do not have low-dimensional representations, and Gowers asked whether mixing for {(x,xg,xg^2)} could hold for quasirandom groups. I do not know if this is the case for arbitrary quasirandom groups, but I was able to settle the question for a specific class of quasirandom groups, namely the special linear groups {G := SL_d(F)} over a finite field {F} in the regime where the dimension {d} is bounded (but is at least two) and {F} is large. Indeed, for such groups I can obtain a count of {(\mu(A) \mu(B) \mu(C) + O( |F|^{-\min(d-1,2)/8} )) |G|^2} for the number of pairs {(x,g) \in G^2} with {(x, xg, xg^2) \in A \times B \times C}. In fact, I have the somewhat stronger statement that there are {(\mu(A) \mu(B) \mu(C) \mu(D) + O( |F|^{-\min(d-1,2)/8} )) |G|^2} pairs {(x,g) \in G^2} with {(x,xg,xg^2,g) \in A \times B \times C \times D} for any {A,B,C,D \subset G}.

I was also able to obtain a partial result for the length four progression {(x,xg,xg^2, xg^3)} in the simpler two-dimensional case {G = SL_2(F)}, but I had to make the unusual restriction that the group element {g \in G} was hyperbolic in the sense that it was diagonalisable over the finite field {F} (as opposed to diagonalisable over the algebraic closure {\overline{F}} of that field); this amounts to the discriminant of the matrix being a quadratic residue, and this holds for approximately half of the elements of {G}. The result is then that for any {A,B,C,D \subset G}, one has {(\frac{1}{2} \mu(A) \mu(B) \mu(C) \mu(D) + o(1)) |G|^2} pairs {(x,g) \in G} with {g} hyperbolic and {(x,xg,xg^2,xg^3) \subset A \times B \times C \times D}. (Again, I actually show a slightly stronger statement in which {g} is restricted to an arbitrary subset {E} of hyperbolic elements.)

For the length three argument, the main tools used are the Cauchy-Schwarz inequality, the quasirandomness of {G}, and some algebraic geometry to ensure that a certain family of probability measures on {G} that are defined algebraically are approximately uniformly distributed. The length four argument is significantly more difficult and relies on a rather ad hoc argument involving, among other things, expander properties related to the work of Bourgain and Gamburd, and also a “twisted” version of an argument of Gowers that is used (among other things) to establish an inverse theorem for the {U^3} norm.

I give some details of these arguments below the fold.

Read the rest of this entry »

Title: Use basic examples to calibrate exponents

Motivation: In the more quantitative areas of mathematics, such as analysis and combinatorics, one has to frequently keep track of a large number of exponents in one’s identities, inequalities, and estimates.  For instance, if one is studying a set of N elements, then many expressions that one is faced with will often involve some power N^p of N; if one is instead studying a function f on a measure space X, then perhaps it is an L^p norm \|f\|_{L^p(X)} which will appear instead.  The exponent p involved will typically evolve slowly over the course of the argument, as various algebraic or analytic manipulations are applied.  In some cases, the exact value of this exponent is immaterial, but at other times it is crucial to have the correct value of p at hand.   One can (and should) of course carefully go through one’s arguments line by line to work out the exponents correctly, but it is all too easy to make a sign error or other mis-step at one of the lines, causing all the exponents on subsequent lines to be incorrect.  However, one can guard against this (and avoid some tedious line-by-line exponent checking) by continually calibrating these exponents at key junctures of the arguments by using basic examples of the object of study (sets, functions, graphs, etc.) as test cases.  This is a simple trick, but it lets one avoid many unforced errors with exponents, and also lets one compute more rapidly.

Quick description: When trying to quickly work out what an exponent p in an estimate, identity, or inequality should be without deriving that statement line-by-line, test that statement with a simple example which has non-trivial behaviour with respect to that exponent p, but trivial behaviour with respect to as many other components of that statement as one is able to manage.   The “non-trivial” behaviour should be parametrised by some very large or very small parameter.  By matching the dependence on this parameter on both sides of the estimate, identity, or inequality, one should recover p (or at least a good prediction as to what p should be).

General discussion: The test examples should be as basic as possible; ideally they should have trivial behaviour in all aspects except for one feature that relates to the exponent p that one is trying to calibrate, thus being only “barely” non-trivial.   When the object of study is a function, then (appropriately rescaled, or otherwise modified) bump functions are very typical test objects, as are Dirac masses, constant functions, Gaussians, or other functions that are simple and easy to compute with.  In additive combinatorics, when the object of study is a subset of a group, then subgroups, arithmetic progressions, or random sets are typical test objects.  In graph theory, typical examples of test objects include complete graphs, complete bipartite graphs, and random graphs. And so forth.

This trick is closely related to that of using dimensional analysis to recover exponents; indeed, one can view dimensional analysis as the special case of exponent calibration when using test objects which are non-trivial in one dimensional aspect (e.g. they exist at a single very large or very small length scale) but are otherwise of a trivial or “featureless” nature.   But the calibration trick is more general, as it can involve parameters (such as probabilities, angles, or eccentricities) which are not commonly associated with the physical concept of a dimension.  And personally, I find example-based calibration to be a much more satisfying (and convincing) explanation of an exponent than a calibration arising from formal dimensional analysis.

When one is trying to calibrate an inequality or estimate, one should try to pick a basic example which one expects to saturate that inequality or estimate, i.e. an example for which the inequality is close to being an equality.  Otherwise, one would only expect to obtain some partial information on the desired exponent p (e.g. a lower bound or an upper bound only).  Knowing the examples that saturate an estimate that one is trying to prove is also useful for several other reasons – for instance, it strongly suggests that any technique which is not efficient when applied to the saturating example, is unlikely to be strong enough to prove the estimate in general, thus eliminating fruitless approaches to a problem and (hopefully) refocusing one’s attention on those strategies which actually have a chance of working.

Calibration is best used for the type of quick-and-dirty calculations one uses when trying to rapidly map out an argument that one has roughly worked out already, but without precise details; in particular, I find it particularly useful when writing up a rapid prototype.  When the time comes to write out the paper in full detail, then of course one should instead carefully work things out line by line, but if all goes well, the exponents obtained in that process should match up with the preliminary guesses for those exponents obtained by calibration, which adds confidence that there are no exponent errors have been committed.

Prerequisites: Undergraduate analysis and combinatorics.

Read the rest of this entry »

This week I was in Columbus, Ohio, attending a conference on equidistribution on manifolds. I talked about my recent paper with Ben Green on the quantitative behaviour of polynomial sequences in nilmanifolds, which I have blogged about previously. During my talk (and inspired by the immediately preceding talk of Vitaly Bergelson), I stated explicitly for the first time a generalisation of the van der Corput trick which morally underlies our paper, though it is somewhat buried there as we specialised it to our application at hand (and also had to deal with various quantitative issues that made the presentation more complicated). After the talk, several people asked me for a more precise statement of this trick, so I am presenting it here, and as an application reproving an old theorem of Leon Green that gives a necessary and sufficient condition as to whether a linear sequence (g^n x)_{n=1}^\infty on a nilmanifold G/\Gamma is equidistributed, which generalises the famous theorem of Weyl on equidistribution of polynomials.

UPDATE, Feb 2013: It has been pointed out to me by Pavel Zorin that this argument does not fully recover the theorem of Leon Green; to cover all cases, one needs the more complicated van der Corput argument in our paper.

Read the rest of this entry »

In the previous lecture, we studied the recurrence properties of compact systems, which are systems in which all measurable functions exhibit almost periodicity – they almost return completely to themselves after repeated shifting. Now, we consider the opposite extreme of mixing systems – those in which all measurable functions (of mean zero) exhibit mixing – they become orthogonal to themselves after repeated shifting. (Actually, there are two different types of mixing, strong mixing and weak mixing, depending on whether the orthogonality occurs individually or on the average; it is the latter concept which is of more importance to the task of establishing the Furstenberg recurrence theorem.)

We shall see that for weakly mixing systems, averages such as \frac{1}{N} \sum_{n=0}^{N-1} T^n f \ldots T^{(k-1)n} f can be computed very explicitly (in fact, this average converges to the constant (\int_X f\ d\mu)^{k-1}). More generally, we shall see that weakly mixing components of a system tend to average themselves out and thus become irrelevant when studying many types of ergodic averages. Our main tool here will be the humble Cauchy-Schwarz inequality, and in particular a certain consequence of it, known as the van der Corput lemma.

As one application of this theory, we will be able to establish Roth’s theorem (the k=3 case of Szemerédi’s theorem).

Read the rest of this entry »

Archives