You are currently browsing the tag archive for the ‘correlation’ tag.
In these notes we presume familiarity with the basic concepts of probability theory, such as random variables (which could take values in the reals, vectors, or other measurable spaces), probability, and expectation. Much of this theory is in turn based on measure theory, which we will also presume familiarity with. See for instance this previous set of lecture notes for a brief review.
The basic objects of study in analytic number theory are deterministic; there is nothing inherently random about the set of prime numbers, for instance. Despite this, one can still interpret many of the averages encountered in analytic number theory in probabilistic terms, by introducing random variables into the subject. Consider for instance the form
of the prime number theorem (where we take the limit ). One can interpret this estimate probabilistically as
where is a random variable drawn uniformly from the natural numbers up to
, and
denotes the expectation. (In this set of notes we will use boldface symbols to denote random variables, and non-boldface symbols for deterministic objects.) By itself, such an interpretation is little more than a change of notation. However, the power of this interpretation becomes more apparent when one then imports concepts from probability theory (together with all their attendant intuitions and tools), such as independence, conditioning, stationarity, total variation distance, and entropy. For instance, suppose we want to use the prime number theorem (1) to make a prediction for the sum
After dividing by , this is essentially
With probabilistic intuition, one may expect the random variables to be approximately independent (there is no obvious relationship between the number of prime factors of
, and of
), and so the above average would be expected to be approximately equal to
which by (2) is equal to . Thus we are led to the prediction
The asymptotic (3) is widely believed (it is a special case of the Chowla conjecture, which we will discuss in later notes; while there has been recent progress towards establishing it rigorously, it remains open for now.
How would one try to make these probabilistic intuitions more rigorous? The first thing one needs to do is find a more quantitative measurement of what it means for two random variables to be “approximately” independent. There are several candidates for such measurements, but we will focus in these notes on two particularly convenient measures of approximate independence: the “” measure of independence known as covariance, and the “
” measure of independence known as mutual information (actually we will usually need the more general notion of conditional mutual information that measures conditional independence). The use of
type methods in analytic number theory is well established, though it is usually not described in probabilistic terms, being referred to instead by such names as the “second moment method”, the “large sieve” or the “method of bilinear sums”. The use of
methods (or “entropy methods”) is much more recent, and has been able to control certain types of averages in analytic number theory that were out of reach of previous methods such as
methods. For instance, in later notes we will use entropy methods to establish the logarithmically averaged version
of (3), which is implied by (3) but strictly weaker (much as the prime number theorem (1) implies the bound , but the latter bound is much easier to establish than the former).
As with many other situations in analytic number theory, we can exploit the fact that certain assertions (such as approximate independence) can become significantly easier to prove if one only seeks to establish them on average, rather than uniformly. For instance, given two random variables and
of number-theoretic origin (such as the random variables
and
mentioned previously), it can often be extremely difficult to determine the extent to which
behave “independently” (or “conditionally independently”). However, thanks to second moment tools or entropy based tools, it is often possible to assert results of the following flavour: if
are a large collection of “independent” random variables, and
is a further random variable that is “not too large” in some sense, then
must necessarily be nearly independent (or conditionally independent) to many of the
, even if one cannot pinpoint precisely which of the
the variable
is independent with. In the case of the second moment method, this allows us to compute correlations such as
for “most”
. The entropy method gives bounds that are significantly weaker quantitatively than the second moment method (and in particular, in its current incarnation at least it is only able to say non-trivial assertions involving interactions with residue classes at small primes), but can control significantly more general quantities
for “most”
thanks to tools such as the Pinsker inequality.
Joni Teräväinen and I have just uploaded to the arXiv our paper “The structure of correlations of multiplicative functions at almost all scales, with applications to the Chowla and Elliott conjectures“. This is a sequel to our previous paper that studied logarithmic correlations of the form
where were bounded multiplicative functions,
were fixed shifts,
was a quantity going off to infinity, and
was a generalised limit functional. Our main technical result asserted that these correlations were necessarily the uniform limit of periodic functions
. Furthermore, if
(weakly) pretended to be a Dirichlet character
, then the
could be chosen to be
–isotypic in the sense that
whenever
are integers with
coprime to the periods of
and
; otherwise, if
did not weakly pretend to be any Dirichlet character
, then
vanished completely. This was then used to verify several cases of the logarithmically averaged Elliott and Chowla conjectures.
The purpose of this paper was to investigate the extent to which the methods could be extended to non-logarithmically averaged settings. For our main technical result, we now considered the unweighted averages
where is an additional parameter. Our main result was now as follows. If
did not weakly pretend to be a twisted Dirichlet character
, then
converged to zero on (doubly logarithmic) average as
. If instead
did pretend to be such a twisted Dirichlet character, then
converged on (doubly logarithmic) average to a limit
of
-isotypic functions
. Thus, roughly speaking, one has the approximation
for most .
Informally, this says that at almost all scales (where “almost all” means “outside of a set of logarithmic density zero”), the non-logarithmic averages behave much like their logarithmic counterparts except for a possible additional twisting by an Archimedean character
(which interacts with the Archimedean parameter
in much the same way that the Dirichlet character
interacts with the non-Archimedean parameter
). One consequence of this is that most of the recent results on the logarithmically averaged Chowla and Elliott conjectures can now be extended to their non-logarithmically averaged counterparts, so long as one excludes a set of exceptional scales
of logarithmic density zero. For instance, the Chowla conjecture
is now established for either odd or equal to
, so long as one excludes an exceptional set of scales.
In the logarithmically averaged setup, the main idea was to combine two very different pieces of information on . The first, coming from recent results in ergodic theory, was to show that
was well approximated in some sense by a nilsequence. The second was to use the “entropy decrement argument” to obtain an approximate isotopy property of the form
for “most” primes and integers
. Combining the two facts, one eventually finds that only the almost periodic components of the nilsequence are relevant.
In the current situation, each is approximated by a nilsequence, but the nilsequence can vary with
(although there is some useful “Lipschitz continuity” of this nilsequence with respect to the
parameter). Meanwhile, the entropy decrement argument gives an approximation basically of the form
for “most” . The arguments then proceed largely as in the logarithmically averaged case. A key lemma to handle the dependence on the new parameter
is the following cohomological statement: if one has a map
that was a quasimorphism in the sense that
for all
and some small
, then there exists a real number
such that
for all small
. This is achieved by applying a standard “cocycle averaging argument” to the cocycle
.
It would of course be desirable to not have the set of exceptional scales. We only know of one (implausible) scenario in which we can do this, namely when one has far fewer (in particular, subexponentially many) sign patterns for (say) the Liouville function than predicted by the Chowla conjecture. In this scenario (roughly analogous to the “Siegel zero” scenario in multiplicative number theory), the entropy of the Liouville sign patterns is so small that the entropy decrement argument becomes powerful enough to control all scales rather than almost all scales. On the other hand, this scenario seems to be self-defeating, in that it allows one to establish a large number of cases of the Chowla conjecture, and the full Chowla conjecture is inconsistent with having unusually few sign patterns. Still it hints that future work in this direction may need to split into “low entropy” and “high entropy” cases, in analogy to how many arguments in multiplicative number theory have to split into the “Siegel zero” and “no Siegel zero” cases.
Kaisa Matomaki, Maksym Radziwill, and I have uploaded to the arXiv our paper “Correlations of the von Mangoldt and higher divisor functions II. Divisor correlations in short ranges“. This is a sequel of sorts to our previous paper on divisor correlations, though the proof techniques in this paper are rather different. As with the previous paper, our interest is in correlations such as
for medium-sized and large
, where
are natural numbers and
is the
divisor function (actually our methods can also treat a generalisation in which
is non-integer, but for simplicity let us stick with the integer case for this discussion). Our methods also allow for one of the divisor function factors to be replaced with a von Mangoldt function, but (in contrast to the previous paper) we cannot treat the case when both factors are von Mangoldt.
As discussed in this previous post, one heuristically expects an asymptotic of the form
for any fixed , where
is a certain explicit (but rather complicated) polynomial of degree
. Such asymptotics are known when
, but remain open for
. In the previous paper, we were able to obtain a weaker bound of the form
for of the shifts
, whenever the shift range
lies between
and
. But the methods become increasingly hard to use as
gets smaller. In this paper, we use a rather different method to obtain the even weaker bound
for of the shifts
, where
can now be as short as
. The constant
can be improved, but there are serious obstacles to using our method to go below
(as the exceptionally large values of
then begin to dominate). This can be viewed as an analogue to our previous paper on correlations of bounded multiplicative functions on average, in which the functions
are now unbounded, and indeed our proof strategy is based in large part on that paper (but with many significant new technical complications).
We now discuss some of the ingredients of the proof. Unsurprisingly, the first step is the circle method, expressing (1) in terms of exponential sums such as
Actually, it is convenient to first prune slightly by zeroing out this function on “atypical” numbers
that have an unusually small or large number of factors in a certain sense, but let us ignore this technicality for this discussion. The contribution of
for “major arc”
can be treated by standard techniques (and is the source of the main term
; the main difficulty comes from treating the contribution of “minor arc”
.
In our previous paper on bounded multiplicative functions, we used Plancherel’s theorem to estimate the global norm
, and then also used the Katai-Bourgain-Sarnak-Ziegler orthogonality criterion to control local
norms
, where
was a minor arc interval of length about
, and these two estimates together were sufficient to get a good bound on correlations by an application of Hölder’s inequality. For
, it is more convenient to use Dirichlet series methods (and Ramaré-type factorisations of such Dirichlet series) to control local
norms on minor arcs, in the spirit of the proof of the Matomaki-Radziwill theorem; a key point is to develop “log-free” mean value theorems for Dirichlet series associated to functions such as
, so as not to wipe out the (rather small) savings one will get over the trivial bound from this method. On the other hand, the global
bound will definitely be unusable, because the
sum
has too many unwanted factors of
. Fortunately, we can substitute this global
bound with a “large values” bound that controls expressions such as
for a moderate number of disjoint intervals , with a bound that is slightly better (for
a medium-sized power of
) than what one would have obtained by bounding each integral
separately. (One needs to save more than
for the argument to work; we end up saving a factor of about
.) This large values estimate is probably the most novel contribution of the paper. After taking the Fourier transform, matters basically reduce to getting a good estimate for
where is the midpoint of
; thus we need some upper bound on the large local Fourier coefficients of
. These coefficients are difficult to calculate directly, but, in the spirit of a paper of Ben Green and myself, we can try to replace
by a more tractable and “pseudorandom” majorant
for which the local Fourier coefficients are computable (on average). After a standard duality argument, one ends up having to control expressions such as
after various averaging in the parameters. These local Fourier coefficients of
turn out to be small on average unless
is “major arc”. One then is left with a mostly combinatorial problem of trying to bound how often this major arc scenario occurs. This is very close to a computation in the previously mentioned paper of Ben and myself; there is a technical wrinkle in that the
are not as well separated as they were in my paper with Ben, but it turns out that one can modify the arguments in that paper to still obtain a satisfactory estimate in this case (after first grouping nearby frequencies
together, and modifying the duality argument accordingly).
Let be the divisor function. A classical application of the Dirichlet hyperbola method gives the asymptotic
where denotes the estimate
as
. Much better error estimates are possible here, but we will not focus on the lower order terms in this discussion. For somewhat idiosyncratic reasons I will interpret this estimate (and the other analytic number theory estimates discussed here) through the probabilistic lens. Namely, if
is a random number selected uniformly between
and
, then the above estimate can be written as
that is to say the random variable has mean approximately
. (But, somewhat paradoxically, this is not the median or mode behaviour of this random variable, which instead concentrates near
, basically thanks to the Hardy-Ramanujan theorem.)
Now we turn to the pair correlations for a fixed positive integer
. There is a classical computation of Ingham that shows that
The error term in (2) has been refined by many subsequent authors, as has the uniformity of the estimates in the aspect, as these topics are related to other questions in analytic number theory, such as fourth moment estimates for the Riemann zeta function; but we will not consider these more subtle features of the estimate here. However, we will look at the next term in the asymptotic expansion for (2) below the fold.
Using our probabilistic lens, the estimate (2) can be written as
From (1) (and the asymptotic negligibility of the shift by ) we see that the random variables
and
both have a mean of
, so the additional factor of
represents some arithmetic coupling between the two random variables.
Ingham’s formula can be established in a number of ways. Firstly, one can expand out and use the hyperbola method (splitting into the cases
and
and removing the overlap). If one does so, one soon arrives at the task of having to estimate sums of the form
for various . For
much less than
this can be achieved using a further application of the hyperbola method, but for
comparable to
things get a bit more complicated, necessitating the use of non-trivial estimates on Kloosterman sums in order to obtain satisfactory control on error terms. A more modern approach proceeds using automorphic form methods, as discussed in this previous post. A third approach, which unfortunately is only heuristic at the current level of technology, is to apply the Hardy-Littlewood circle method (discussed in this previous post) to express (2) in terms of exponential sums
for various frequencies
. The contribution of “major arc”
can be computed after a moderately lengthy calculation which yields the right-hand side of (2) (as well as the correct lower order terms that are currently being suppressed), but there does not appear to be an easy way to show directly that the “minor arc” contributions are of lower order, although the methods discussed previously do indirectly show that this is ultimately the case.
Each of the methods outlined above requires a fair amount of calculation, and it is not obvious while performing them that the factor will emerge at the end. One can at least explain the
as a normalisation constant needed to balance the
factor (at a heuristic level, at least). To see this through our probabilistic lens, introduce an independent copy
of
, then
using symmetry to order (discarding the diagonal case
) and making the change of variables
, we see that (4) is heuristically consistent with (3) as long as the asymptotic mean of
in
is equal to
. (This argument is not rigorous because there was an implicit interchange of limits present, but still gives a good heuristic “sanity check” of Ingham’s formula.) Indeed, if
denotes the asymptotic mean in
, then we have (heuristically at least)
and we obtain the desired consistency after multiplying by .
This still however does not explain the presence of the factor. Intuitively it is reasonable that if
has many prime factors, and
has a lot of factors, then
will have slightly more factors than average, because any common factor to
and
will automatically be acquired by
. But how to quantify this effect?
One heuristic way to proceed is through analysis of local factors. Observe from the fundamental theorem of arithmetic that we can factor
where the product is over all primes , and
is the local version of
at
(which in this case, is just one plus the
–valuation
of
:
). Note that all but finitely many of the terms in this product will equal
, so the infinite product is well-defined. In a similar fashion, we can factor
where
(or in terms of valuations, ). Heuristically, the Chinese remainder theorem suggests that the various factors
behave like independent random variables, and so the correlation between
and
should approximately decouple into the product of correlations between the local factors
and
. And indeed we do have the following local version of Ingham’s asymptotics:
Proposition 1 (Local Ingham asymptotics) For fixed
and integer
, we have
and
From the Euler formula
we see that
and so one can “explain” the arithmetic factor in Ingham’s asymptotic as the product of the arithmetic factors
in the (much easier) local Ingham asymptotics. Unfortunately we have the usual “local-global” problem in that we do not know how to rigorously derive the global asymptotic from the local ones; this problem is essentially the same issue as the problem of controlling the minor arc contributions in the circle method, but phrased in “physical space” language rather than “frequency space”.
Remark 2 The relation between the local means
and the global mean
can also be seen heuristically through the application
of Mertens’ theorem, where
is Pólya’s magic exponent, which serves as a useful heuristic limiting threshold in situations where the product of local factors is divergent.
Let us now prove this proposition. One could brute-force the computations by observing that for any fixed , the valuation
is equal to
with probability
, and with a little more effort one can also compute the joint distribution of
and
, at which point the proposition reduces to the calculation of various variants of the geometric series. I however find it cleaner to proceed in a more recursive fashion (similar to how one can prove the geometric series formula by induction); this will also make visible the vague intuition mentioned previously about how common factors of
and
force
to have a factor also.
It is first convenient to get rid of error terms by observing that in the limit , the random variable
converges vaguely to a uniform random variable
on the profinite integers
, or more precisely that the pair
converges vaguely to
. Because of this (and because of the easily verified uniform integrability properties of
and their powers), it suffices to establish the exact formulae
in the profinite setting (this setting will make it easier to set up the recursion).
We begin with (5). Observe that is coprime to
with probability
, in which case
is equal to
. Conditioning to the complementary probability
event that
is divisible by
, we can factor
where
is also uniformly distributed over the profinite integers, in which event we have
. We arrive at the identity
As and
have the same distribution, the quantities
and
are equal, and (5) follows by a brief amount of high-school algebra.
We use a similar method to treat (6). First treat the case when is coprime to
. Then we see that with probability
,
and
are simultaneously coprime to
, in which case
. Furthermore, with probability
,
is divisible by
and
is not; in which case we can write
as before, with
and
. Finally, in the remaining event with probability
,
is divisible by
and
is not; we can then write
, so that
and
. Putting all this together, we obtain
and the claim (6) in this case follows from (5) and a brief computation (noting that in this case).
Now suppose that is divisible by
, thus
for some integer
. Then with probability
,
and
are simultaneously coprime to
, in which case
. In the remaining
event, we can write
, and then
and
. Putting all this together we have
which by (5) (and replacing by
) leads to the recursive relation
and (6) then follows by induction on the number of powers of .
The estimate (2) of Ingham was refined by Estermann, who obtained the more accurate expansion
for certain complicated but explicit coefficients . For instance,
is given by the formula
where is the Euler-Mascheroni constant,
The formula for is similar but even more complicated. The error term
was improved by Heath-Brown to
; it is conjectured (for instance by Conrey and Gonek) that one in fact has square root cancellation
here, but this is well out of reach of current methods.
These lower order terms are traditionally computed either from a Dirichlet series approach (using Perron’s formula) or a circle method approach. It turns out that a refinement of the above heuristics can also predict these lower order terms, thus keeping the calculation purely in physical space as opposed to the “multiplicative frequency space” of the Dirichlet series approach, or the “additive frequency space” of the circle method, although the computations are arguably as messy as the latter computations for the purposes of working out the lower order terms. We illustrate this just for the term below the fold.
Given two unit vectors in a real inner product space, one can define the correlation between these vectors to be their inner product
, or in more geometric terms, the cosine of the angle
subtended by
and
. By the Cauchy-Schwarz inequality, this is a quantity between
and
, with the extreme positive correlation
occurring when
are identical, the extreme negative correlation
occurring when
are diametrically opposite, and the zero correlation
occurring when
are orthogonal. This notion is closely related to the notion of correlation between two non-constant square-integrable real-valued random variables
, which is the same as the correlation between two unit vectors
lying in the Hilbert space
of square-integrable random variables, with
being the normalisation of
defined by subtracting off the mean
and then dividing by the standard deviation of
, and similarly for
and
.
One can also define correlation for complex (Hermitian) inner product spaces by taking the real part of the complex inner product to recover a real inner product.
While reading the (highly recommended) recent popular maths book “How not to be wrong“, by my friend and co-author Jordan Ellenberg, I came across the (important) point that correlation is not necessarily transitive: if correlates with
, and
correlates with
, then this does not imply that
correlates with
. A simple geometric example is provided by the three unit vectors
in the Euclidean plane :
and
have a positive correlation of
, as does
and
, but
and
are not correlated with each other. Or: for a typical undergraduate course, it is generally true that good exam scores are correlated with a deep understanding of the course material, and memorising from flash cards are correlated with good exam scores, but this does not imply that memorising flash cards is correlated with deep understanding of the course material.
However, there are at least two situations in which some partial version of transitivity of correlation can be recovered. The first is in the “99%” regime in which the correlations are very close to : if
are unit vectors such that
is very highly correlated with
, and
is very highly correlated with
, then this does imply that
is very highly correlated with
. Indeed, from the identity
(and similarly for and
) and the triangle inequality
Thus, for instance, if and
, then
. This is of course closely related to (though slightly weaker than) the triangle inequality for angles:
Remark 1 (Thanks to Andrew Granville for conversations leading to this observation.) The inequality (1) also holds for sub-unit vectors, i.e. vectors
with
. This comes by extending
in directions orthogonal to all three original vectors and to each other in order to make them unit vectors, enlarging the ambient Hilbert space
if necessary. More concretely, one can apply (1) to the unit vectors
in
.
But even in the “” regime in which correlations are very weak, there is still a version of transitivity of correlation, known as the van der Corput lemma, which basically asserts that if a unit vector
is correlated with many unit vectors
, then many of the pairs
will then be correlated with each other. Indeed, from the Cauchy-Schwarz inequality
Thus, for instance, if for at least
values of
, then (after removing those indices
for which
)
must be at least
, which implies that
for at least
pairs
. Or as another example: if a random variable
exhibits at least
positive correlation with
other random variables
, then if
, at least two distinct
must have positive correlation with each other (although this argument does not tell you which pair
are so correlated). Thus one can view this inequality as a sort of `pigeonhole principle” for correlation.
A similar argument (multiplying each by an appropriate sign
) shows the related van der Corput inequality
and this inequality is also true for complex inner product spaces. (Also, the do not need to be unit vectors for this inequality to hold.)
Geometrically, the picture is this: if positively correlates with all of the
, then the
are all squashed into a somewhat narrow cone centred at
. The cone is still wide enough to allow a few pairs
to be orthogonal (or even negatively correlated) with each other, but (when
is large enough) it is not wide enough to allow all of the
to be so widely separated. Remarkably, the bound here does not depend on the dimension of the ambient inner product space; while increasing the number of dimensions should in principle add more “room” to the cone, this effect is counteracted by the fact that in high dimensions, almost all pairs of vectors are close to orthogonal, and the exceptional pairs that are even weakly correlated to each other become exponentially rare. (See this previous blog post for some related discussion; in particular, Lemma 2 from that post is closely related to the van der Corput inequality presented here.)
A particularly common special case of the van der Corput inequality arises when is a unit vector fixed by some unitary operator
, and the
are shifts
of a single unit vector
. In this case, the inner products
are all equal, and we arrive at the useful van der Corput inequality
(In fact, one can even remove the absolute values from the right-hand side, by using (2) instead of (4).) Thus, to show that has negligible correlation with
, it suffices to show that the shifts of
have negligible correlation with each other.
Here is a basic application of the van der Corput inequality:
Proposition 2 (Weyl equidistribution estimate) Let
be a polynomial with at least one non-constant coefficient irrational. Then one has
where
.
Note that this assertion implies the more general assertion
for any non-zero integer (simply by replacing
by
), which by the Weyl equidistribution criterion is equivalent to the sequence
being asymptotically equidistributed in
.
Proof: We induct on the degree of the polynomial
, which must be at least one. If
is equal to one, the claim is easily established from the geometric series formula, so suppose that
and that the claim has already been proven for
. If the top coefficient
of
is rational, say
, then by partitioning the natural numbers into residue classes modulo
, we see that the claim follows from the induction hypothesis; so we may assume that the top coefficient
is irrational.
In order to use the van der Corput inequality as stated above (i.e. in the formalism of inner product spaces) we will need a non-principal ultrafilter (see e.g this previous blog post for basic theory of ultrafilters); we leave it as an exercise to the reader to figure out how to present the argument below without the use of ultrafilters (or similar devices, such as Banach limits). The ultrafilter
defines an inner product
on bounded complex sequences
by setting
Strictly speaking, this inner product is only positive semi-definite rather than positive definite, but one can quotient out by the null vectors to obtain a positive-definite inner product. To establish the claim, it will suffice to show that
for every non-principal ultrafilter .
Note that the space of bounded sequences (modulo null vectors) admits a shift , defined by
This shift becomes unitary once we quotient out by null vectors, and the constant sequence is clearly a unit vector that is invariant with respect to the shift. So by the van der Corput inequality, we have
for any . But we may rewrite
. Then observe that if
,
is a polynomial of degree
whose
coefficient is irrational, so by induction hypothesis we have
for
. For
we of course have
, and so
for any . Letting
, we obtain the claim.
Recent Comments