You are currently viewing Herding, polls of polls and Bayes’ theorem

Herding, polls of polls and Bayes’ theorem

You have probably heard of that outlier Iowa poll that gave a surprisingly large advantage to Harris in the 2024 election. Related to this is that is what forecasters call “herding”. Nate Silver, of forecasting fame, wrote:

It’s obviously a really close race. But for some pollsters, it’s a little too close.

Take, for example, this afternoon’s polling release from the British firm Redfield & Wilton. They polled all seven of the core battleground states. And in all seven, Kamala Harris and Donald Trump each received between 47 and 48 percent of the vote: [image omitted]

Isn’t this a little convenient? Whatever happens, Redfield & Wilton — not a firm with a well-established reputation in the US — will be able to throw up their hands and say “well, we projected a tie, so don’t blame us!”. And since all of these states are also close in the polling averages, they’ll also ensure that they won’t rank at the bottom of the table of the most and least accurate pollsters — although unless the race really is that close, and it probably won’t be, they also won’t rank toward the top.

Nate Silver is saying that there’s too little sampling error per the expected amount from the typical sample size (n = 800). Having too little between study variation is not normally found in meta-analyses in science, rather there’s too much. When there’s more variation between studies’ estimates than can be attributed to sampling error alone, it is wise to look for so-called moderators, that is, study-level properties that explain variation between studies e.g., type of measure used, year of publication, range restriction. Endless debates about results depend on these. Excess between study variation is also why random effects meta-analysis models are almost universally recommended by statisticians because they allow for a distribution of effect sizes rather than just a single value. All of this is based on there being too much between study variation (heterogeneity, usually quantified by I²). What about too little? This can happen if researchers are cheating by not publishing studies that didn’t find p < .05 (publication bias) or by reanalyzing them until they do (p-hacking). Specifically, the test of insufficient variance (TIVA) was developed by Ulrich Schimmack in 2014 and used on the studies by Bem, who published a set of 10 studies trying to show that people can know the future. Nate Silver is essentially applying the same method here:

However, there is another angle to this I haven’t seen mentioned so far: Bayesianism with an informative prior. Herding is usually explained in psychological terms. Pollsters are rated by their ability to forecast the eventual result of the election, how much their best guess deviated from the truth (the error). This rating is their lifeblood and is also used as weights in the various meta-analyses, or “poll of polls” as forecasters call them. How do you get a better rating? Well, you could sample more people, but that’s expensive. You could employ better sampling strategies, but that’s also expensive or maybe impossible (some people just refuse to talk to you). You could employ better analytic methods, basically various ways to account for who is likely to vote and trying to weight the sample to match the population of the state, given that participation in surveys is not random (volunteer bias) and people may be coy/shy about their intentions. Another method is to just not publish results that deviate a lot from the average and thus are likely to be wrong. This would waste your money spent, but improve your pollster rating. Finally, you could adjust your results to be closer to the average of recent polls, making it more likely to be correct. If everybody does this, it will then appear as if the overall estimate is extremely precise, whereas it is not really, just that the real sampling error has been sneakily reduced. This messes up the meta-analyses, and makes Nate Silver unhappy.

However, there is another way to look at this. If you are a pollster, and you want to present readers with the current best estimate for some state based on your latest data plus the previous data, you would actually be wise to adjust your new results towards the mean. This is the bread and butter of Bayesianism. The results from the earlier studies form the prior, your new poll adds some evidence, and you end up with a mix which is your result. From this perspective, the pollsters are being good Bayesians but are being punished for it (Silver says their method down ranks pollsters that have insufficient variance). I guess this kind of forbidden rate base is not what Phil Tetlock had in mind, but it kinda fits. Here’s an example based on a situation of 5 prior polls of n = 1000 which average about 50% (the prior), and one new outlier poll near 57% (the evidence), as well as the new best estimate (the posterior).

The solution is simple, at least in theory. Pollsters can publish two sets of results per poll. Their best estimate for a given poll for popular consumption (Bayesian), and the raw estimate for meta-analysis (frequentist, or Bayes with uninformative prior, same thing). The problem is that in so far as herding is done to optimize their pollster rankings, they can’t release both as Nate Silver and others would down rank them if their raw estimates are bad. So instead they will have to keep doing Bayes behind the scenes to everybody’s detriment.