The U.S. presidential election is now only a few weeks away.  The politics of this election are of course interesting and important, but I do not want to discuss these topics here (there is not exactly a shortage of other venues for such a discussion), and would request that readers refrain from doing so in the comments to this post.  However, I thought it would be apropos to talk about some of the basic mathematics underlying electoral polling, and specifically to explain the fact, which can be highly unintuitive to those not well versed in statistics, that polls can be accurate even when sampling only a tiny fraction of the entire population.

Take for instance a nationwide poll of U.S. voters on which presidential candidate they intend to vote for.  A typical poll will ask a number $n$ of randomly selected voters for their opinion; a typical value here is $n = 1000$.  In contrast, the total voting-eligible population of the U.S. – let’s call this set $X$ – is about 200 million.  (The actual turnout in the election is likely to be closer to 100 million, but let’s ignore this fact for the sake of discussion.)  Thus, such a poll would sample about 0.0005% of the total population $X$ – an incredibly tiny fraction.  Nevertheless, the margin of error (at the 95% confidence level) for such a poll, if conducted under idealised conditions (see below), is about 3%.  In other words, if we let $p$ denote the proportion of the entire population $X$ that will vote for a given candidate $A$, and let $\overline{p}$ denote the proportion of the polled voters that will vote for $A$, then the event $\overline{p}-0.03 \leq p \leq \overline{p}+0.03$ will occur with probability at least 0.95.  Thus, for instance (and oversimplifying a little – see below), if the poll reports that 55% of respondents would vote for A, then the true percentage of the electorate that would vote for A has at least a 95% chance of lying between 52% and 58%.  Larger polls will of course give a smaller margin of error; for instance the margin of error for an (idealised) poll of 2,000 voters is about 2%.

I’ll give a rigorous proof of a weaker version of the above statement (giving a margin of error of about 7%, rather than 3%) in an appendix at the end of this post.  But the main point of my post here is a little different, namely to address the common misconception that the accuracy of a poll is a function of the relative sample size rather than the absolute sample size, which would suggest that a poll involving only 0.0005% of the population could not possibly have a margin of error as low as 3%.  I also want to point out some limitations of the mathematical analysis; depending on the methodology and the context, some polls involving 1000 respondents may have a much higher margin of error than the idealised rate of 3%.

– Assumptions and conclusion –

Not all polls are created equal; there are a certain number of hypotheses on the methodology and effectiveness of the poll that we have to assume in order to make our mathematical conclusions valid.  We will make the following idealised assumptions:

1. Simple question. Voters polled can only offer one of two responses, which I will call A and not-A; thus we ignore the effect of third-party candidates, undecided voters, or refusals to respond.  In particular, we do not try to combine this data with other questions about the polled voters, such as demographic data.  We also assume that the question is unambiguous and cannot be misinterpreted by respondents (see Hypothesis 3 below).
2. Perfect response rate. All voters polled offer a response; there are no refusals to respond to the poll, or failures to make contact with the voter being polled.  (This is a special case of 1., but deserves to be emphasised.)  In particular, this excludes polls that are self-selected, such as internet polls (since in most cases, a large fraction of viewers of a web page with a poll will refuse to respond to that poll).
3. Honest responses.  The response given by a voter to the poll is an accurate representation whether that voter intends to vote for $A$ or not; thus we ignore response-distorting effects such as the Bradley effect or push-polling, as well as tactical voting, frivolous responses, misunderstanding of the question, or attempts to “game” a poll by the respondents.
4. Fixed poll size. The number $n$ of polled voters is fixed in advance; in particular, one cannot keep polling until one has achieved some desired outcome, and then stop.
5. Simple random sampling (without replacement).  Each one of the $n$ voters polled is selected uniformly at random among the entire population $X$, thus each voter is equally likely to be selected by the poll, and no non-voter can be selected by the poll.  (In particular, we make the important assumption that there is no selection bias.)  Furthermore, each polled voter is chosen independently of all the others, except for the one condition that we do not poll any given voter more than once.  (Thus, once a voter is polled, that voter is “crossed off the list” of the pool $X$ of voters that one randomly selects from to determine the next voter polled.)   In particular, we assume that the poll is not clustered.
6. Honest reporting. The results of the poll are always reported, with no inaccuracies; one cannot cancel, modify, or ignore a poll once it has begun.  In particular, one cannot conduct multiple polls and only report the “best” results (thus running the risk of confirmation bias).

Polls which deviate significantly from these hypotheses (e.g. due to complex questions, self-selection or other selection bias, confirmation bias, inaccurate responses, a high refusal rate, variable poll size, or clustering) will generally be less accurate than an idealised poll with the same sample size.  Of course, there is a substantial literature in statistics (and polling methodology) devoted to measuring, mitigating, avoiding, or compensating for these less ideal situations, but we will not discuss those (important) issues here.  We will remark though that in practice it is difficult to make the poll selection truly uniform.  For instance, if one is conducting a telephone poll, then the sample will of course be heavily biased towards those voters who actually own phones; a little more subtly, it will also be biased toward those voters who are near their phones at the time the poll was conducted, and have the time and inclination to answer phone calls.  As long as these factors are not strongly correlated with the poll question (i.e. whether the voter will vote for A), this is not a major concern, but in some cases, the poll methodology will need to be adjusted (e.g. by reweighting the sample) to compensate for the non-uniformity.

As stated in the introduction, we let $p$ be the proportion of the entire population $X$ that will vote for $A$, and $\overline{p}$ be the proportion of the polled voters that will vote for $A$ (which, by Hypotheses 2 and 3, is exactly equal to the proportion of polled voters that say that they will vote for $A$).  Under the above idealised conditions, if the number $n$ of polled voters is 1,000, and the size of the population $X$ is 200 million, then the margin of error is about 3%, thus ${\Bbb P}( \overline{p}-0.03 \leq p \leq \overline{p} + 0.03 ) \geq 0.95$.  (See this margin of error calculator for what happens with different choices of parameters.)

There is an important subtlety here: it is only the unconditional probability of the event $\overline{p}-0.03 \leq p \leq \overline{p} + 0.03$ that is guaranteed to be greater than 0.95.  If one has additional prior information about $p$ and $\overline{p}$, then the conditional probability of this event, relative to this information, may be very different.  For instance, if one had, prior to the poll, a very good reason to believe that $p$ is almost certainly between 0.4 and 0.6, and then the poll reports $\overline{p}$ to be 0.1, then the conditional probability that $\overline{p}-0.03 \leq p \leq \overline{p}+0.03$ occurs should be lower than the unconditional probability.  [Note though that having priori information just about $p$, and not $\overline{p}$, will not cause the probability to drop below 95%, as this bound on the confidence level is uniform in $p$.] The question of how to account for prior information is a very delicate one in Bayesian probability, and will not be discussed here.

One special case of the above point  is worth emphasising: the statement that $\overline{p}-0.03 \leq p \leq \overline{p} + 0.03$ is true with at least 95% probability is only valid before one actually conducts the poll and finds out the value of $\overline{p}$.  Once $\overline{p}$ is computed, the statement $\overline{p}-0.03 \leq p \leq \overline{p} + 0.03$ is either true or false, i.e. occurs with probability 1 or 0 (unless one takes a Bayesian approach, as mentioned above).  [This phenomenon of course occurs all the time in probability.  For instance, if x denotes the outcome of rolling a fair six-sided die, then before one performs this roll, the probability that x equals 1 will be 1/6, but after one has seen what the value of this die is, the probability that x equals 1 will be either 1 or 0.]

– Nobody asked for my opinion! –

One intuitive argument against a poll of small relative size being accurate goes something like this: a poll of just 1,000 people among a population of 200,000,000 is almost certainly not going to poll myself, or any of my friends or acquaintances.  If the opinions of myself, and everyone that I know, is not being considered at all in this poll, how could this poll possibly be accurate?

It is true that if you know, say, 5,000 voting-eligible people, then chances are that none of them (or maybe one of them, at best) will be contacted by the above poll.  However, even though the opinions of all these people are not being directly polled, there will be many other people with equivalent opinions that will be contacted by the poll.  Through those people, the views of yourself and your friends are being represented.  [This may seem like a very weak form of representation, but recall that you and your 5,000 friends and acquaintances still only represent 0.0025% of the total electorate.]

Now one may argue that no two voters are identical, and that each voter arrives at a decision of who to vote for their own unique reasons.  True enough – but recall that this poll is asking only a simple question: whether one is going to vote for A or not.  Once one narrowly focuses on this question alone, any two voters who both decide to vote for A, or to not vote for A, are considered equivalent, even if they arrive at this decision for totally different reasons.  So, for the purposes of this poll, there are only two types of voters in the world – A-voters, and not-A-voters -  with all voters in one of these two types considered equivalent.   In particular, any given voter is going to have millions of other equivalent voters distributed throughout the population $X$, and a representative fraction of those equivalent voters is likely to be picked up by the poll.

As mentioned before, polls which offer complex questions (for instance, trying to discern the motivation behind one’s voting choices) will inherently be less accurate; there are now fewer equivalent voters for each individual, and it is harder for a poll to pick up each equivalence class in a representative manner.  (In particular, the more questions that are asked, the more likely it becomes that the responses to at least one of these questions will be inaccurate by an amount exceeding its margin of error.  This provides a limit as to how much information one can confidently extract from data mining any given data set.)

– Is there enough information? –

Another common objection to the accuracy of polls argues that there is not enough information (or “degrees of freedom”) present in the poll sample to accurately describe the much larger amount of data present in the full population; 1,000 bits of data cannot possibly contain 200,000,000 bits of information.  However, we are not asking to find out so much information; the purpose of the poll is to estimate just a single piece of information, namely the number $p$.  If one is willing to accept an error of up to 3%, then one can represent this piece of information in about five bits rather than 200,000,000.  So, in principle at least, there is more than enough information present in the poll to recover this information; one does not need to sample the entire population to get a good reading.  (The same general philosophy underlies compressed sensing, but that’s another story.)

As before, the accuracy degrades as one asks more and more complicated questions.  For instance, if one were to poll 1,000 voters for their opinions on two unrelated questions A and B, each of the answers to A and B would be accurate to within 3% with probability 95%, but the probability that the answers to A and B were simultaneously accurate to within 3% would be lower (around 90% or so), and so any data analysis that relies on the responses to both A and B may not have as high a confidence level as data analysis that relies on A and B separately.  This is consistent with the information-theoretic perspective: we are demanding more and more bits of information on our population, and it is harder for our fixed data set to supply so much information accurately and confidently.

– Swings –

One intuitive way to gauge the margin of error of a poll is to see how likely such a poll is to accurately detect a swing in the electorate.  Suppose for instance that over the course of a given time period (e.g. a week), 7% of the voters switch their vote from not-A to A, while another 2% of the voters switch their vote from A to not-A, leading to a net increase of 5% in the proportion $p$ of voters voting for A. How does would this swing in the vote affect the proportion $\overline{p}$ of the voters being polled, if one imagines the same voters being polled at both the start of the week and at the end of the week?

If the poll was conducted by simple random sampling, then each of the 1,000 voters polled would have a 7% probability of switching from not-A to A, and and a 2% probability of switching from A to not-A.  Thus, one would expect about 70 of the 1,000 voters polled to switch to A, and about 20 to switch to not-A, leading to a net swing of 50 voters, that would increase $\overline{p}$ by 5%, thus matching the increase in p.  Now, in practice, there will be some variability here; due to the luck of the draw, the poll may pick up more or less than 70 of the voters switching to A, and more or less than 20 of the voters switching to not-A.  But having 1,000 voters to sample is just about large enough for the law of large numbers to kick in and ensure that the number of voters switching to A picked up by the poll will be significantly larger than the number of voters switching to not-A.  Thus, this poll will have a good chance of detecting a swing of size 5% or more, which is consistent with the assertion of a margin of error of about 3%. [In appealing to the law of large numbers, we are implicitly exploiting the uniformity and independence assumptions in Hypothesis 5.]

It is worth noting that this swing of 5% in an electorate of 200,000,000 voters represents quite a large shift in absolute terms: fourteen million voters switching to A and four million switching away from A.  Quite a few of these shifting voters will be picked up by the poll (in contrast to one’s sphere of friends and acquaintances, which is likely to be missed completely).

– Irregularity –

Another intuitive objection to polling accuracy is that the voting population is far from homogeneous.  For instance, it is clear that voting preferences for the U.S. presidential election vary widely among the 50 states – shouldn’t one need to multiply the poll size by 50 just to accomodate this fact?  Similarly for distinctions in voting patterns based on gender, race, party affiliation, etc.

Again, these irregularities in voter distribution do not affect the final accuracy of the poll, for two reasons.  Firstly, we are asking only the simple question of whether a voter votes for A or not-A, and are not breaking down the answers to this question by state, gender, race, or any other factor; as stated before, two voters are considered equivalent as long as they have the same preference for A, even if they are in different states, have different genders, etc.  Secondly, while it is conceivable that the poll will cluster its sample in one particular state (or one particular gender, etc.), thus potentially skewing the poll, the fact that the voters are selected uniformly and independently of each other prevents this from happening very often.  (And in any event, clustering in a demographic or geographic category is not what is of direct importance to the accuracy of the poll; the only thing that really matters in the end is whether there is clustering in the category of A-voters or not-A-voters.)  The independence hypothesis is rather important.  If for instance one were to poll by picking one particular location in the U.S. at random, and polling 1,000 people from that location, then the responses would be highly correlated (as one could have picked a location which happens to highly favour A, or highly favour not-A) and would have a much larger margin of error than if one polled 1,000 people at random across the U.S..

[Incidentally, in the specific case of the U.S. presidential election, statewide polls are in fact more relevant to the outcome of the election than nationwide polls, due to the mechanics of the U.S. Electoral College, but this does not detract from the above points.]

– Analogies –

Some analogies may help explain why the relative size of a sample is largely irrelevant to the accuracy of a poll.

Suppose one is in front of a large body of water (e.g. a sea or ocean), and wants to determine whether it is a freshwater or saltwater body.  This can be done very easily: dip one’s finger into the body of water and taste a single drop.  This gives an extremely accurate result, even though the relative proportion of the sample size to the population size is, literally, a drop in the ocean; the quintillions of water molecues and salt molecues present in that drop are more than sufficient to give a good reading of the salinity of the water body.

[To be fair, in order for this reading to be accurate, one needs to assume that the salinity is uniformly distributed across the body of water; if for instance the body happened to be nearly fresh on one side and much saltier on the other, then dipping one's finger in just one of these two sides would lead to an inaccurate measurement of average salinity.  But if one were to stir the body of water vigorously, this irregularity of distribution disappears.  The procedure of taking a random sample, with each sample point being independent of all the others, is analogous to this stirring procedure.]

Another analogy comes from digital imaging.  As we all know, a digital camera takes a picture of a real-world object (e.g. a human face) and converts it into an array of pixels; an image with a larger number of pixels will generally lead to a more accurate image than one with fewer.  But even with just a handful of pixels, say 1,000 pixels, one is already able to make crude distinctions between different images, for instance to distinguish a light-skinned face from a dark-skinned face (despite the fact that skin colour is determined by millions of cells and quintillions of pigment molecues).  See for instance this well-known (and very low resolution) image of a US president, by Leon Harmon:

– Appendix: Mathematical justification –

One can compute the margin of error for this simple sampling problem very precisely using the binomial distribution; however I would like to present here a cruder but more robust estimate, based on the second moment method, that works in much greater generality than the setting discussed here.  (It is closely related to the arguments in my previous post on the law of large numbers.)  The main mathematical result we need is

Theorem. Let X be a finite set, let A be a subset of X, and let $p := |A|/|X|$ be the proportion of elements of X that lie in A.  Let $x_1, \ldots, x_n$ be sampled independently and uniformly at random from X (in particular, we allow repetitions).  Let $\overline{p} := |\{1 \leq i \leq n: x_i \in A \}|/n$ be the proportion of the $x_1,\ldots,x_n$ (counting repetition) that lie in A.  Then for any $r > 0$, one has

$\displaystyle {\Bbb P}( |\overline{p}-p| \leq r ) \geq 1 - \frac{1}{4 n r^2}$. (1)

Proof. We use the second moment method.  For each $1 \leq i \leq n$, let $I_i$ be the indicator of the event $x_i \in A$, thus $I_i := 1$ when $x_i \in A$ and $I_i = 0$ otherwise.  Observe that each $I_i$ has a probability of p of equaling 1, thus

$p = {\Bbb E} I_i.$

On the other hand, we have

$\overline{p} = \frac{1}{n} \sum_{i=1}^n I_i$.

Thus

$\overline{p}-p = \frac{1}{n} \sum_{i=1}^n I_i - {\Bbb E}(I_i)$;

squaring this and taking expectations, we obtain

${\Bbb E} |\overline{p}-p|^2 = \frac{1}{n^2} \sum_{i=1}^n {\bf Var}(I_i) + \frac{2}{n} \sum_{1 \leq i < j \leq n} {\bf Cov}(I_i,I_j)$

where ${\bf Var}(I_i) := {\Bbb E} (I_i-{\Bbb E} I_i)^2$ is variance of $I_i$, and ${\bf Cov}(I_i,I_j) := {\Bbb E}( (I_i-p) (I_j-p))$ is the covariance of $I_i, I_j$.

By assumption, the random variable $I_i, I_j$ for $i \neq j$ are independent, and so the covariances ${\bf Cov}(I_i, I_j)$ vanish.  On the other hand, a direct computation shows that

${\bf Var}(I_i) = p - p^2 = \frac{1}{4} - (p-\frac{1}{2})^2 \leq \frac{1}{4}$

for each i.  Putting all this together we conclude that

${\Bbb E} |\overline{p}-p|^2 \leq \frac{1}{4n}$

and the claim (1) follows from Markov’s inequality. $\Box$

Applying this theorem with n=1000 and $r=1/\sqrt{200} \approx 0.07$, we conclude that p and $\overline{p}$ lie within about 7% of each other with probability at least 95%, regardless of how large the population X is.  In the context of an election poll, this means that if one samples 1000 voters independently at random (with replacement) whether they would vote for A, the margin of error for the answer would be at most 7% at the 95% confidence level.

Remark 1. Observe that the proof of the above theorem did not really need the $x_i$ to be fully independent of each other; the key thing was that each $x_i$ was close to uniformly distributed, and that the covariances between the indicators $I_i, I_j$ were small.  (Thus one only needs pairwise independence rather than joint independence for the theorem to hold.)  Because of this, one can also obtain variants of the above theorem when one selects $x_1,\ldots,x_n$ for random sampling without replacement (known as simple random sampling); now there is a slight correlation between $I_i, I_j$, but it turns out to be negligible when X is large, for instance when n=1000 and $|X| \sim 10^8$.  (For this range of parameters, there is a non-trivial probability of a birthday paradox occurring, so the two sampling methods are genuinely different from each other; but they turn out to have almost the same margin of error anyway.) $\diamond$

Remark 2. If one assumes joint independence instead of pairwise independence, one can obtain slightly sharper inequalities than (1) (e.g. by using the Chernoff inequality), but at the 95% confidence level, this gives a relatively modest improvement only in the margin of error (in our specific example, the optimal margin of error is about 3% rather than 7%).  $\diamond$

Remark 3. An inspection of the argument shows that if p is known to be very small or very large, then the margin of error is better than what (1) predicts.  (In the most extreme case, if p=0 or p=1, then it is easy to see that the margin of error is zero.)  But in the case of election polls, p is generally expected to be close to 1/2, and so one does not expect to be able to improve the margin of error much from this effect. And in any case, we don’t know the value of p exactly in practice (otherwise why would we be doing the poll in the first place?).  $\diamond$

Remark 4. In real world situations, it can be difficult or impractical to get the $x_i$ to be close to uniformly distributed (because of sampling bias), and to keep the correlations low (because of effects such as clustering).  Because of this, one often needs to perform a more complicated sampling procedure than simple random sampling, which requires more sophisticated statistical analysis than given by the above theorem.  This is beyond the scope of this post, though. $\diamond$

[Updated, October 13: added emphasis that the confidence level only applies before one performs the poll, not afterwards.]

[Updated, October 17: Minor corrections; thanks to Tom Verhoeff for pointing them out.]