Nate's methodology is fairly complex, but at heart, he judges pollsters by looking at how close their last poll was to an election-result. This seems intuitively fair, but mathematically, this poses problems for what he is trying to do.
Arithmetically, a poll before election day can be expressed as Poll=Opinion+Bias+Sampling Eror, so absolute error from an election result can be expressed as |Opinion + Bias + Sampling Error - Result| = |Last Minute Movement + Bias + Sampling Error|[1].
The problem arises from trying to study Sampling Error from a statistic whose effects are mainly dominated by last minute movement and Pollster Bias. Because last minute movement and Pollster Bias vary considerably between elections, adjustment would be very difficult. Even worse, this method throws away the data that comes from the vast majority of polls that come earlier in the cycle. There exists a better way!
In order to understand how to proceed, it's important to stress the difference between House Effects(Bias), Design Effects(PIE), and their implications for forecasting. It comess down to the difference between accuracy and precision.
Consider a hypothetical pollster with an absurdly large Design Effect but a negligible House Effect. Let's call it Zogby Interactive[2] .

Zogby performs just as well on average as an ideal pollster, but it's sampling error is much higher then expected. By comparing the variance of the poll series with that of public opinion, we can estimate excess sampling error. Note that Zogby Interactive Polls provide very little information about a race
Now consider another hypothetical pollster with a pretty small design effect but a large Pro-Republican Bias named Rasmussen.

Rasmussen polls follow Public Opinion almost perfectly, but with a pro-Republic Bias. By measuring difference from public opinion, we can estimate House Effects. Note that once Bias is estimated and accounted for, Rasmussen polls provide quite a bit of information about a race.
Nate Silver has noted both effects, and has tried to estimate House Effects separately in the past. But Opinion, House Effects, and Design Effects all require each other to estimate, and so any accurate model should consider them jointly.
In order to do so, we've set up a Bayesian Dynamic Linear Model and estimated it with MCMC. [5] The model is pretty simple, and is a slight generalization of the site's previous methodology that has performed pretty well [3]:
We assume that public opinion follows a simple random walk[4]. But public opinion isn't directly available. All we have are noisy and biased polls.

A Dynamic Linear Model, where x(t) represents public opinion, and y(t) represent observed polls, arrows indicate conditional probabilities. See footnote [5] for details
Actual Estimates:
One of our system's best feature is that it's a fully Bayesian probability model, allowing us to derive joint posterior distributions of all of our parameters.
Because of this, we can judge the primary debate between Mark at Pollster.com and Nate at FiveThirtyEight. Do Design Effects(PIE) vary significantly among Pollsters?
Short Answer: No
In order to check this, we ran our model on this cycle's Generic Ballot Polling, setting very weakly informative priors for each Pollster's PIE, without assuming any relationship between them.

Estimated Pollster PIE vs Number of polls in the Generic Ballot database.
The issue here is that PIE is a ratio strictly greater then one, and so greater uncertainty will drive the mean estimate in only one direction. On Andrew Gelman's suggestion, I imposed a Hierarchical model on the priors in order to account for this, leading to the following estimates:

Corrected PIE vs number of Polls in Generic Ballot
There's little reason to believe from this graph that Pollster Introduced Error in the Generic Ballot varies at all. The difference between even the best and worst pollster is not statistically significant from zero.
It seems that pollsters perform about 25% worse then would be predicted by ideal random sampling, which would add less then point of error. This is in agreement with previous research, but directly contradicts Nate's findings that polls have an added error of 3-4 points.
The discrepancy occurs because Nate's estimating method conflates House Effects with PIE. This inflates the estimates to be bigger then they actually are.

Preliminary House Effects by Pollster with Standard errors. Pollsters in Bold have effects statistically significant then zero. [6]
The Exception that Proves the Rule:
While the graphs were from the Generic Ballot Polling of this cycle, the results hold for other national and state level races from 2008, 2009, and 2010. It seems that for election horse-racing, accuracy does not vary significantly among Pollsters [7]. But polling for Obama Approval seems to have much higher Design Effects and little more differentiation between Pollsters.

Estimated Pollster Introduced Error vs Number of Polls in Obama Approval Database
It's not clear why Polling for Obama Approval is more unreliable then polling for elections. It could be that because there is no accountability in the form of an election result, discipline and standards are laxer then otherwise.
What Comes Next:

Preliminary Graph of Obama Job Approval with 95% confidence intervals, adjusted for House Affects and PIE.
Over the next couple of days, we plan to release an election forecasting model for House, Senate and Governor races, as well as introducing a house-effect and PIE adjusted tracker for the Generic Ballot and Obama Job Approval. This is to be done in collaboration with the Professor Wang at the Princeton Election Consortium and a soon-to-be-named major polling website. Expect more in the coming days.
Footnotes
[1] - 538 looks at all polls from 21 days before the election to try to correct for last minute mean reversion. Last minute movement should increase as we move further from election day, potentially creating another confounding factor.
[2] - This gives Zogby Interactive too much credit, they have a very large house effect as well as a design effect.
[3] - This is based on the previous work of Simon Jackman of Stanford and Mark Pickup at Oxford and the UK PoliticsHome Poll team, who both had generously made code available.
[4]This is a technical footnote, but there is a good deal of evidence that public opinion of major parties in Western Democracies does indeed follow a simple random walk, as opposed to one subject to "momentum" and "mean-reversion".
[5] That is to say,
Opinion(t) = Opinion(t-1) + e(t), where e(t) is a random variable with mean 0.
Meanwhile, a poll with n respondents on day t is a normally distributed random variable with mean Opinion(t)+Bias and variance [Opinion(t)*(100-Opinion(t))/SampleSize]*PIE^2, where PIE denotes Pollster Introduced Error or a Design Effect.
Under this specification, PIE can be interpreted as the Ratio of actual sample variance to variance predicted by perfect random sampling.
[6] House-Effects seem to vary a bit from race to race. We're planning to produce more precise House Effect Estimates by pooling parameters from different races using a further multi-level model.
[7] Except Zogby Interactive. Zogby Interactive did not show up in the graphs because they have not fielded any polls in the Generic Ballot. But their PIE in other races has been flagged as an outlier, having a much higher Design Effect then any other Pollster.
5 comments:
Hmm -- It looks like you will be able to do a pollster by pollster comparison on variance from the mean for both election polling and presidential approval to see if there are systematic house differences on the two measures. THAT will be interesting. Thanks for taking on the task.
Since house effects can differ between years and races, I wonder whether we shouldn't treat them as some sort of pollster-induced error. I presume they're refining their methods to get rid of biases that show up, so we shouldn't expect pollsters to have consistent house effects over multiple races or years - for example, Silver's data shows SurveyUSA's house effect to bounce around a bit early in their history. If a pollster has a history of varying biased house effects, perhaps the model should indeed take that into account. Perhaps Silver's way of folding in the house effect straight into the PIE weights isn't the right way to go about things, mathematically - I would think you would be able to start to estimate what pollsters' house effects are on the fly with reasonable prior distributions and enough data. (And perhaps some pollsters in fact do have a consistent lean in one direction in their house effects over multiple cycles.)
Is there any chance of trying to model the mean and variance of pollsters' house effects? Or is there not enough data?
AySz88,
I think that's a valid point. I'm going to try to run through historical data this week to get a handle on how House Effects have changed through from cycle to cycle. It's harder then it sounds, just because the historical datasets are pretty crappy, and this sort of thing isn't robust to coding errors (Think how our estimates for a race would change if Ras released a poll with the Dem and Rep numbers flipped)
On the second point, I'm pretty concerned about within-cycle House-Effect instability. I suspect I'll be able to produce useful estimates for the bigger pollsters (Ras, Yougov, etc), but I'm still working out the right way to do it.
一個人的價值,應該看他貢獻了什麼,而不是他取得了什麼.................................................................
一個人的價值,應該看他貢獻了什麼,而不是他取得了什麼.................................................................
Post a Comment