A large portion of analytic number theory is concerned with the distribution of number-theoretic sets such as the primes, or quadratic residues in a certain modulus. At a local level (e.g. on a short interval {[x,x+y]}), the behaviour of these sets may be quite irregular. However, in many cases one can understand the global behaviour of such sets on very large intervals, (e.g. {[1,x]}), with reasonable accuracy (particularly if one assumes powerful additional conjectures, such as the Riemann hypothesis and its generalisations). For instance, in the case of the primes, we have the prime number theorem, which asserts that the number of primes in a large interval {[1,x]} is asymptotically equal to {x/\log x}; in the case of quadratic residues modulo a prime {p}, it is clear that there are exactly {(p-1)/2} such residues in {[1,p]}. With elementary arguments, one can also count statistics such as the number of pairs of consecutive quadratic residues; and with the aid of deeper tools such as the Weil sum estimates, one can count more complex patterns in these residues also (e.g. {k}-point correlations).

One is often interested in converting this sort of “global” information on long intervals into “local” information on short intervals. If one is interested in the behaviour on a generic or average short interval, then the question is still essentially a global one, basically because one can view a long interval as an average of a long sequence of short intervals. (This does not mean that the problem is automatically easy, because not every global statistic about, say, the primes is understood. For instance, we do not know how to rigorously establish the conjectured asymptotic for the number of twin primes {n,n+2} in a long interval {[1,N]}, and so we do not fully understand the local distribution of the primes in a typical short interval {[n,n+2]}.)

However, suppose that instead of understanding the average-case behaviour of short intervals, one wants to control the worst-case behaviour of such intervals (i.e. to establish bounds that hold for all short intervals, rather than most short intervals). Then it becomes substantially harder to convert global information to local information. In many cases one encounters a “square root barrier”, in which global information at scale {x} (e.g. statistics on {[1,x]}) cannot be used to say anything non-trivial about a fixed (and possibly worst-case) short interval at scales {x^{1/2}} or below. (Here we ignore factors of {\log x} for simplicity.) The basic reason for this is that even randomly distributed sets in {[1,x]} (which are basically the most uniform type of global distribution one could hope for) exhibit random fluctuations of size {x^{1/2}} or so in their global statistics (as can be seen for instance from the central limit theorem). Because of this, one could take a random (or pseudorandom) subset of {[1,x]} and delete all the elements in a short interval of length {o(x^{1/2})}, without anything suspicious showing up on the global statistics level; the edited set still has essentially the same global statistics as the original set. On the other hand, the worst-case behaviour of this set on a short interval has been drastically altered.

One stark example of this arises when trying to control the largest gap between consecutive prime numbers in a large interval {[x,2x]}. There are convincing heuristics that suggest that this largest gap is of size {O( \log^2 x )} (Cramér’s conjecture). But even assuming the Riemann hypothesis, the best upper bound on this gap is only of size {O( x^{1/2} \log x )}, basically because of this square root barrier. This particular instance of the square root barrier is a significant obstruction to the current polymath project “Finding primes“.

On the other hand, in some cases one can use additional tricks to get past the square root barrier. The key point is that many number-theoretic sequences have special structure that distinguish them from being exactly like random sets. For instance, quadratic residues have the basic but fundamental property that the product of two quadratic residues is again a quadratic residue. One way to use this sort of structure to amplify bad behaviour in a single short interval into bad behaviour across many short intervals. Because of this amplification, one can sometimes get new worst-case bounds by tapping the average-case bounds.

In this post I would like to indicate a classical example of this type of amplification trick, namely Burgess’s bound on short character sums. To narrow the discussion, I would like to focus primarily on the following classical problem:

Problem 1 What are the best bounds one can place on the first quadratic non-residue {n_p} in the interval {[1,p-1]} for a large prime {p}?

(The first quadratic residue is, of course, {1}; the more interesting problem is the first quadratic non-residue.)

Probabilistic heuristics (presuming that each non-square integer has a 50-50 chance of being a quadratic residue) suggests that {n_p} should have size {O( \log p )}, and indeed Vinogradov conjectured that {n_p = O_\varepsilon(p^\varepsilon)} for any {\varepsilon > 0}. Using the Pólya-Vinogradov inequality, one can get the bound {n_p = O( \sqrt{p} \log p )} (and can improve it to {n_p = O(\sqrt{p})} using smoothed sums); combining this with a sieve theory argument (exploiting the multiplicative nature of quadratic residues) one can boost this to {n_p = O( p^{\frac{1}{2\sqrt{e}}} \log^2 p )}. Inserting Burgess’s amplification trick one can boost this to {n_p = O_\varepsilon( p^{\frac{1}{4\sqrt{e}}+\varepsilon} )} for any {\varepsilon > 0}. Apart from refinements to the {\varepsilon} factor, this bound has stood for five decades as the “world record” for this problem, which is a testament to the difficulty in breaching the square root barrier.

Note: in order not to obscure the presentation with technical details, I will be using asymptotic notation {O()} in a somewhat informal manner.

— 1. Character sums —

To approach the problem, we begin by fixing the large prime {p} and introducing the Legendre symbol {\chi(n) = \left(\tfrac{n}{p}\right)}, defined to equal {0} when {n} is divisible by {p}, {+1} when {n} is an invertible quadratic residue modulo {p}, and {-1} when {n} is an invertible quadratic non-residue modulo {p}. Thus, for instance, {\chi(n) = +1} for all {1 \leq n < n_p}. One of the main reasons one wants to work with the function {\chi} is that it enjoys two easily verified properties:

  • {\chi} is periodic with period {p}.
  • One has the total multiplicativity property {\chi(nm) = \chi(n) \chi(m)} for all integers {n,m}.

In the jargon of number theory, {\chi} is a Dirichlet character with conductor {p}. Another important property of this character is of course the law of quadratic reciprocity, but this law is more important for the average-case behaviour in {p}, whereas we are concerned here with the worst-case behaviour in {p}, and so we will not actually use this law here.

An obvious way to control {n_p} is via the character sum

\displaystyle  \sum_{1 \leq n \leq x} \chi(n). \ \ \ \ \ (1)

From the triangle inequality, we see that this sum has magnitude at most {x}. If we can then obtain a non-trivial bound of the form

\displaystyle  \sum_{1 \leq n \leq x} \chi(n) = o(x) \ \ \ \ \ (2)

for some {x}, this forces the existence of a quadratic nonresidue less than or equal to {x}, thus {n_p \leq x}. So one approach to the problem is to bound the character sum (1).

As there are just as many residues as non-residues, the sum (1) is periodic with period {p} and we obtain a trivial bound of {p} for the magnitude of the sum. One can achieve a non-trivial bound by Fourier analysis. One can expand

\displaystyle  \chi(n) = \sum_{a=0}^{p-1} \hat \chi(a) e^{2\pi i a n / p }

where {\hat \chi(a)} are the Fourier coefficients of {\chi}:

\displaystyle  \hat \chi(a) := \frac{1}{p} \sum_{n=0}^{p-1} \chi(n) e^{-2\pi i an/p}.

As there are just as many quadratic residues as non-residues, {\hat \chi(0)=0}, so we may drop the {a=0} term. From summing the geometric series we see that

\displaystyle  \sum_{1 \leq n \leq x} e^{2\pi i an/p} = O( 1 / \| a/p \| ), \ \ \ \ \ (3)

where {\|a/p\|} is the distance from {a/p} to the nearest integer ({0} or {1}); inserting these bounds into (1) and summing what is essentially a harmonic series in {a} we obtain

\displaystyle  \sum_{1 \leq n \leq x} \chi(n) = O( p \log p \sup_{a \neq 0} |\hat \chi(a)| ).

Now, how big is {\hat \chi(a)}? Taking absolute values, we get a bound of {1}, but this gives us something worse than the trivial bound. To do better, we use the Plancherel identity

\displaystyle  \sum_{a=0}^{p-1} |\hat \chi(a)|^2 = \frac{1}{p} \sum_{n=0}^{p-1} |\chi(n)|^2

which tells us that

\displaystyle  \sum_{a=0}^{p-1} |\hat \chi(a)|^2 = O(1).

This tells us that {\hat \chi} is small on the average, but does not immediately tell us anything new about the worst-case behaviour of {\chi}, which is what we need here. But now we use the multiplicative structure of {\chi} to relate average-case and worst-case behaviour. Note that if {b} is coprime to {p}, then {\chi(bn)} is a scalar multiple of {\chi(n)} by a quantity {\chi(b)} of magnitude {1}; taking Fourier transforms, this implies that {\hat \chi(a/b)} and {\hat \chi(a)} also differ by this factor. In particular, {|\hat \chi(a/b)|=|\hat \chi(a)|}. As {b} was arbitrary, we thus see that {|\hat \chi(a)|} is constant for all {a} coprime to {p}; in other words, the worst case is the same as the average case. Combining this with the Plancherel bound one obtains {|\hat \chi(a)| = O(1/\sqrt{p})}, leading to the Pólya-Vinogradov inequality

\displaystyle  \sum_{1 \leq n \leq x} \chi(n) = O( \sqrt{p} \log p ).

(In fact, a more careful computation reveals the slightly sharper bound {|\sum_{1 \leq n \leq x} \chi(n)| \leq \sqrt{p} \log p}; this is non-trivial for {x > \sqrt{p} \log p}.)

Remark 1 Up to logarithmic factors, this is consistent with what one would expect if {\chi} fluctuated like a random sign pattern (at least for {x} comparable to {p}; for smaller values of {x}, one expects instead a bound of the form {O(\sqrt{x})}, up to logarithmic factors). It is conjectured that the {\log p} factor can be replaced with a {O(\log \log p)} factor, which would be consistent with the random fluctuation model and is best possible; this is known for GRH, but unconditionally the Pólya-Vinogradov inequality is still the best known. (See however this paper of Granville and Soundararajan for an improvement for non-quadratic characters {\chi}.)

A direct application of the Pólya-Vinogradov inequality gives the bound {n_p \leq \sqrt{p} \log p}. One can get rid of the logarithmic factor (which comes from the harmonic series arising from (3)) by replacing the sharp cutoff {1_{1 \leq n \leq x}} by a smoother sum, which has a better behaved Fourier transform. But one can do better still by exploiting the multiplicativity of {\chi} again, by the following trick of Vinogradov. Observe that not only does one have {\chi(n)=+1} for all {n \leq n_p}, but also {\chi(n) = +1} for any {n} which is {n_p-1}-smooth, i.e. is the product of integers less than {n_p}. So even if {n_p} is significantly less than {x}, one can show that the sum (1) is large if the majority of integers less than {x} are {n_p-1}-smooth.

Since every integer {n} less than {x} is either {n_p}-smooth (in which case {\chi(n)=+1}), or divisible by a prime {q} between {n_p} and {x} (in which case {\chi(n)} is at least {-1}), we obtain the lower bound

\displaystyle  \sum_{1 \leq n \leq x} \chi(n) \geq \sum_{1 \leq n \leq x} 1 - \sum_{n_p < q \leq x} \sum_{1 \leq n \leq x: q|n} 2.

Clearly, {\sum_{1 \leq n \leq x} 1 = x+O(1)} and {\sum_{1 \leq n \leq x: q|n} 2 = 2\frac{x}{q} + O(1)}. The total number of primes less than {x} is {O(\frac{x}{\log x}) = o(x)} by the prime number theorem, thus

\displaystyle  \sum_{1 \leq n \leq x} \chi(n) \geq x - \sum_{n_p < q \leq x} 2\frac{x}{q} + o(x).

Using the classical asymptotic {\sum_{q \leq y} \frac{1}{q} = \log \log y + C + o(1)} for some absolute constant {C} (which basically follows from the prime number theorem, but also has an elementary proof), we conclude that

\displaystyle  \sum_{1 \leq n \leq x} \chi(n) \geq x [1 - 2 \log \frac{\log x}{\log n_p} + o(1)].

If {n_p \geq x^{\frac{1}{\sqrt{e}}+\varepsilon}} for some fixed {\varepsilon > 0}, then the expression in brackets is bounded away from zero for {x} large; in particular, this is incompatible with (2) for {x} large enough. As a consequence, we see that if we have a bound of the form (2), then we can conclude {n_p = O_\varepsilon( x^{\frac{1}{\sqrt{e}} + \varepsilon} )} for all {\varepsilon > 0}; in particular, from the Pólya-Vinogradov inequality one has

\displaystyle  n_p = O_\varepsilon( p^{\frac{1}{2\sqrt{e}} + \varepsilon} )

for all {\varepsilon > 0}, or equivalently that {n_p \leq p^{\frac{1}{2\sqrt{e}} + o(1)}}. (By being a bit more careful, one can refine this to {n_p = O( p^{\frac{1}{2\sqrt{e}}} \log^{2/\sqrt{e}} p )}.)

Remark 2 The estimates on the Gauss-type sums { \hat \chi(a) := \frac{1}{p} \sum_{n=0}^{p-1} \chi(n) e^{-2\pi i an/p}} are sharp; nevertheless, they fail to penetrate the square root barrier in the sense that no non-trivial estimates are provided below the scale {\sqrt{p}}. One can also see this barrier using the Poisson summation formula, which basically gives a formula that (very roughly) takes the form

\displaystyle  \sum_{n = O(x)} \chi(n) \sim \frac{x}{\sqrt{p}} \sum_{n=O(p/x)} \chi(n)

for any {1 < x < p}, and is basically the limit of what one can say about character sums using Fourier analysis alone. In particular, we see that the Pólya-Vinogradov bound is basically the Poisson dual of the trivial bound. The scale {x=\sqrt{p}} is the crossing point where Poisson summation does not achieve any non-trivial modification of the scale parameter.

— 2. Average-case bounds —

The Pólya-Vinogradov bound establishes a non-trivial estimate (1) for {x} significantly larger than {\sqrt{p} \log p}. We are interested in extending (1) to shorter intervals.

Before we address this issue for a fixed interval {[1,x]}, we first study the average-case bound on short character sums. Fix a short length {y}, and consider the shifted sum

\displaystyle  \sum_{a \leq n \leq a+y} \chi(n), \ \ \ \ \ (4)

where {a} is a parameter. The analogue of (1) for such intervals would be

\displaystyle  \sum_{a \leq n \leq a+y} \chi(n) = o(y). \ \ \ \ \ (5)

For {y} very small (e.g. {y=p^\varepsilon} for some small {\epsilon > 0}), we do not know how to establish (5) for all {a}; but we can at least establish (5) for almost all {a}, with only about {O(\sqrt{p})} exceptions (here we see the square root barrier again!).

More precisely, we will establish the moment estimates

\displaystyle  \frac{1}{p} \sum_{a=0}^{p-1} |\sum_{a \leq n \leq a+y} \chi(n)|^k = O_k( y^{k/2} + y^k p^{-1/2} ) \ \ \ \ \ (6)

for any positive even integer {k=2,4,\ldots}. If {y} is not too tiny, say {y \geq p^{\varepsilon}} for some {\varepsilon > 0}, then by applying (6) for a sufficiently large {k} and using Chebyshev’s inequality (or Markov’s inequality), we see (for any given {\delta > 0}) that one has the non-trivial bound

\displaystyle  |\sum_{a \leq n \leq a+y} \chi(n)| \leq \delta y

for all but at most {O_{\delta,\epsilon}(\sqrt{p})} values of {a \in [1,p]}.

To see why (6) is true, let us just consider the easiest case {k=2}. Squaring both sides, we expand (6) as

\displaystyle  \frac{1}{p} \sum_{a=0}^{p-1} \sum_{a \leq n,m \leq a+y} \chi(n) \chi(m) = O(y) + O(y^2 p^{-1/2} ).

We can write {\chi(n) \chi(m)} as {\chi(n m)}. Writing {m=n+h}, and using the periodicity of {\chi}, we can rewrite the left-hand side as

\displaystyle  \sum_{h=-y}^y (y-|h|) [\frac{1}{p} \sum_{n \in F_p} \chi(n(n+h))]

where we have abused notation and identified the finite field {F_p} with {\{0,1,\ldots,p-1\}}.

For {h=0}, the inner average is {O(1)}. For {h} non-zero, we claim the bound

\displaystyle  \sum_{n \in F_p} \chi(n(n+h))= O( \sqrt{p} ) \ \ \ \ \ (7)

which is consistent with (and is in fact slightly stronger than) what one would get if {\chi} was a random sign pattern; assuming this bound gives (6) for {k=2} as required.

The bound (7) can be established by quite elementary means (as it comes down to counting points on the hyperbola {y^2=x(x+h)}, which can be done by transforming the hyperbola to be rectangular), but for larger values of {k} we will need the more general estimate

\displaystyle  \sum_{n \in F_p} \chi(P(n))= O_k( \sqrt{p} ) \ \ \ \ \ (8)

whenever {P} is a polynomial over {F} of degree {k} which is not a constant multiple of a perfect square; this can be easily seen to give (6) for general {k}.

An equivalent form of (8) is that the hyperelliptic curve

\displaystyle  \{ (x,y) \in F_p \times F_p: y^2 = P(x) \} \ \ \ \ \ (9)

contains {p + O_k(\sqrt{p})} points. This fact follows from a general theorem of Weil establishing the Riemann hypothesis for curves over function fields, but can also be deduced by a more elementary argument of Stepanov, using the polynomial method, which we now give here. (This arrangement of the argument is based on the exposition in the book by Iwaniec and Kowalski.)

By translating the {x} variable we may assume that {P(0)} is non-zero. The key lemma is the following. Assume {p} large, and take {l} to be an integer comparable to {\sqrt{p}} (other values of this parameter are possible, but this is the optimal choice). All polynomials {Q(x)} are understood to be over the field {F_p} (i.e. they lie in the polynomial ring {F_p[X]}), although indeterminate variables {x} need not lie in this field.

Lemma 2 There exists a non-zero polynomial {Q(x)} of one indeterminate variable {x} over {F_p} of degree at most {lp/2 + O_k(p)} which vanishes to order at least {l} at every point {x \in F_p} for which {P(x)} is a quadratic residue.

Note from the factor theorem that {Q} can vanish to order at least {l} at at most {\hbox{deg}(Q)/l \leq p/2 + O_k(\sqrt{p})} points, and so we see that {P(x)} is an invertible quadratic residue for at most {p/2 + O_k(\sqrt{p})} values of {F_p}. Multiplying {P} by a quadratic non-residue and running the same argument, we also see that {P(x)} is an invertible quadratic non-residue for at most {p/2 + O_k(\sqrt{p})} values of {F_p}, and (8) (or the asymptotic for the number of points in (9)) follows.

We now prove the lemma. The polynomial {Q} will be chosen to be of the form

\displaystyle  Q(x) = P^l(x) ( R(x,x^p) + P^{\frac{p-1}{2}}(x) S(x,x^p) )

where {R(x,z), S(x,z)} are polynomials of degree at most {\frac{p-k-1}{2}} in {x}, and degree at most {\frac{l}{2} + C} in {z}, where {C} is a large constant (depending on {k}) to be chosen later (these parameters have been optimised for the argument that follows). Since {P} has degree at most {k}, such a {Q} will have degree

\displaystyle  \leq kl + \frac{p-k-1}{2} + \frac{p-1}{2} k + p (\frac{l}{2}+C') = \frac{lp}{2} + O_k(p)

as required. We claim (for suitable choices of {C,C'}) that

  • (a) The degrees are small enough that {Q(x)} is a non-zero polynomial whenever {R(x,z), S(x,z)} are non-zero polynomials; and
  • (b) The degrees are large enough that there exists a non-trivial choice of {R(x,z)} and {S(x,z)} that {Q(x)} vanishes to order at least {l} whenever {x \in F_p} is such that {P(x)} is a quadratic residue.

Claims (a) and (b) together establish the lemma.

We first verify (a). We can cancel off the initial {P^l} factor, so that we need to show that {R(x,x^p) + P^{\frac{p-1}{2}}(x) S(x,x^p)} does not vanish when at least one of {R(x,z), Q(x,z)} is not vanishing. We may assume that {R, Q} are not both divisible by {z}, since we could cancel out a common factor of {x^p} otherwise.

Suppose for contradiction that the polynomial {R(x,x^p) + P^{\frac{p-1}{2}} S(x,x^p)} vanished, which implies that {R(x,0) = - P^{\frac{p-1}{2}}(x) S(x,0)} modulo {x^p}. Squaring and multiplying by {P}, we see that

\displaystyle  R(x,0)^2 P(x) = P(x)^p S(x,0)^2 \hbox{ mod } x^p.

But over {F_p} and modulo {x^p}, {P(x)^p = P(0)} by Fermat’s little theorem. Observe that {R(x,0)^2 P(x)} and {P(0) S(x,0)^2} both have degree at most {p-1}, and so we can remove the {x^p} modulus and conclude that {R(x,0)^2 P(x) = P(0) S(x,0)^2} over {F_p}. But this implies (by the fundamental theorem of arithmetic for {F_p[X]}) that {P} is a constant multiple of a square, a contradiction. (Recall that {P(0)} is non-zero, and that {R(x,0)} and {S(x,0)} are not both zero.)

Now we prove (b). Let {x \in F_p} be such that {P(x)} is a quadratic residue, thus {P(x)^{\frac{p-1}{2}} = +1} by Fermat’s little theorem. To get vanishing to order {l}, we need

\displaystyle  \frac{d^j}{dx^j} [ P^l(x) ( R(x,x^p) + P^{\frac{p-1}{2}}(x) S(x,x^p) ) ] = 0 \ \ \ \ \ (10)

for all {0 \leq j < l}. (Of course, we cannot define derivatives using limits and Newton quotients in this finite characteristic setting, but we can still define derivatives of polynomials formally, thus for instance {\frac{d}{dx} x^n := n x^{n-1}}, and enjoy all the usual rules of calculus, such as the product rule and chain rule.)

Over {F_p}, the polynomial {x^p} has derivative zero. If we then compute the derivative in (10) using many applications of the product and chain rule, we see that the left-hand side of (10) can be expressed in the form

\displaystyle  P^{l-j}(x) [ R_j(x,x^p) + P^{\frac{p-1}{2}}(x) S_j(x,x^p) ) ]

where {R_j(x,z), S_j(x,z)} are polynomials of degree at most {\frac{p-k-1}{2} + O_k(j)} in {x} and at most {\frac{l}{2}+C} in {z}, whose coefficients depend in some linear fashion on the coefficients of {R} and {S}. (The exact nature of this linear relationship will depend on {k, p, P}, but this will not concern us.) Since we only need to evaluate this expression when {P(x)^{\frac{p-1}{2}} = +1} and {x^p=p} (by Fermat’s little theorem), we thus see that we can verify (10) provided that the polynomial

\displaystyle  P^{l-j}(x) [ R_j(x,x) + S_j(x,x) ) ]

vanishes identically. This is a polynomial of degree at most

\displaystyle  O(l-j) + \frac{p-k-1}{2} + O_k(j) + \frac{l}{2}+C = \frac{p}{2} + O_k(p^{1/2}) + C,

and there are {l+1} possible values of {j}, so this leads to

\displaystyle  \frac{lp}{2} + O_k(p) + O( C \sqrt{p} )

linear constraints on the coefficients of {R} and {S} to satisfy. On the other hand, the total number of these coefficients is

\displaystyle  2 \times (\frac{p-k-1}{2} + O(1)) \times (\frac{l}{2}+C + O(1)) = \frac{lp}{2} + Cp + O_k(p).

For {C} large enough, there are more coefficients than constraints, and so one can find a non-trivial choice of coefficients obeying the constraints (10), and (b) follows.

Remark 3 If one optimises all the constants here, one gets an upper bound of basically {8k \sqrt{p}} for the deviation in the number of points in (9). This is only a little worse than the sharp bound of {2g\sqrt{p}} given from Weil’s theorem, where {g = \lfloor \frac{k-1}{2}\rfloor} is the genus; however, it is possible to boost the former bound to the latter by using a version of the tensor power trick (generalising {F_p} to {F_{p^m}} and then letting {m \rightarrow \infty}) combined with the theory of Artin L-functions and the Riemann-Roch theorem. This is (very briefly!) sketched in the Tricki article on the tensor power trick.

Remark 4 Once again, the global estimate (8) is very sharp, but cannot penetrate below the square root barrier, in that one is allowed to have about {O(\sqrt{p})} exceptional values of {a} for which no cancellation exists. One expects that these exceptional values of {a} in fact do not exist, but we do not know how to do this unless {y} is larger than {x^{1/4}} (so that the Burgess bounds apply).

— 3. The Burgess bound —

The average case bounds in the previous section give an alternate demonstration of a non-trivial estimate (1) for {x > p^{1/2+\varepsilon}}, which is just a bit weaker than what the Pólya-Vinogradov inequality gives. Indeed, if (1) failed for such an {x}, thus

\displaystyle  |\sum_{n \in [1,x]} \chi(n)| \gg x,

then by taking a small {y} (e.g. {y = p^{\varepsilon/2}}) and covering {[1,x]} by intervals of length {y}, we see (from a first moment method argument) that

\displaystyle  |\sum_{a \leq n \leq a+y} \chi(n)| \gg y

for a positive fraction of the {a} in {[1,x]}. But this contradicts the results of the previous section.

Burgess observed that by exploiting the multiplicativity of {\chi} one last time to amplify the above argument, one can extend the range for which (1) can be proved from {x > p^{1/2 + \varepsilon}} to also cover the range {p^{1/4+\varepsilon} < x < p^{1/2}}. The idea is not to cover {[1,x]} by intervals of length {y}, but rather by arithmetic progressions {\{ a, a+r, \ldots, a+yr\}} of length {y}, where {a = O(x)} and {r = O(x/y)}. Another application of the first moment method then shows that

\displaystyle  |\sum_{0 \leq j \leq y} \chi(a+jr)| \gg y

for a positive fraction of the {a} in {[1,x]} and {r} in {[1,x/y]} (i.e. {\gg x^2/y} such pairs {(a,r)}).

For technical reasons, it will be inconvenient if {a} and {r} have too large of a common factor, so we pause to remove this possibility. Observe that for any {d \geq 1}, the number of pairs {(a,r)} which have {d} as a common factor is {O( \frac{1}{d^2} x^2/y )}. As {\sum_{d=1}^\infty \frac{1}{d^2}} is convergent, we may thus remove those pairs which have too large of a common factor, and assume that all pairs {(a,r)} have common factor {O(1)} at most (so are “almost coprime”).

Now we exploit the multiplicativity of {\chi} to write {\chi(a+jr)} as {\chi(r) \chi(b+j)}, where {b} is a residue which is equal to {a/r} mod {q}. Dividing out by {\chi(r)}, we conclude that

\displaystyle  |\sum_{0 \leq j \leq y} \chi(b+j)| \gg y \ \ \ \ \ (11)

for {\gg x^2/y} pairs {(a,r)}.

Now for a key observation: the {\gg x^2/y} values of {b} arising from the pairs {(a,r)} are mostly disjoint. Indeed, suppose that two pairs {(a,r), (a',r')} generated the same value of {b}, thus {a/r = a'/r' \hbox{ mod } p}. This implies that {ar' = a'r \hbox{ mod } p}. Since {x < p^{1/2}}, we see that {ar', a'r} do not exceed {p}, so we may remove the modulus and conclude that {ar'=a'r}. But since we are assuming that {a, r} and {a',r'} are almost coprime, we see that for each {(a,r)} there are at most {O(1)} values of {a',r'} for which {ar'=a'r}. So the {b}‘s here only overlap with multiplicity {O(1)}, and we conclude that (11) holds for {\gg x^2/y} values of {b}. But comparing this with the previous section, we obtain a contradiction unless {x^2/y \ll \sqrt{p}}. Setting {y} to be a sufficiently small power of {p}, we obtain Burgess’s result that (1) holds for {x > p^{1/4+\varepsilon}} for any fixed {\varepsilon > 0}.

Combining Burgess’s estimate with Vinogradov’s sieving trick we conclude the bound {n_p = O_\varepsilon( p^{1/4\sqrt{e} + \varepsilon} )} for all {\varepsilon > 0}, which is the best bound known today on the least quadratic non-residue except for refinements of the {p^\varepsilon} error term.

Remark 5 There are many generalisations of this result, for instance to more general characters (with possibly composite conductor), or to shifted sums (4). However, the {p^{1/4}} type exponent has not been improved except with the assistance of powerful conjectures such as the generalised Riemann hypothesis.