Monday, October 27, 2008

On FiveThirtyEight

I've been often asked the difference between StochasticDemocracy's model and Nate's FiveThirtyEight. Since the response "Ours is better" never gets us very far, I've decided to write out the difference in more depth.

The best way to do that, is to simply describe both models. I've done my best to make this as easy to read as possible, but I'm a math major, so I can't make any promises on my presentation abilities. Because of this, I'll do my best to answer any questions in the comments thread.

Out of deference to his hundreds of thousands of visitors, I'll start with his:

FiveThirtyEight




1) First, he calculates measures of "pollster reliability"(How wrong pollsters tend to be in proportion to a theoretical ideal pollster) and "house effects"(The extent to which some firms are biased toward one canidate or another) for every pollster. He calculates these measures by looking at data from the 2008 primaries, 2006 midterms, and 2004 presidential elections.

2) He then assigns each poll a weight according to to sample size, the age of the poll(he assumes that poll accuracy decays exponentially with time), and reliability behind the pollster. He also changes the margin of the poll according to the historical house affects of the pollster behind the poll

3) He now assigns undecided voters in each poll according to a formula he derived from a regression on the 2008 democratic primaries.

4) He now runs a "trend adjustment" to "update" outdated polls using a regression. The basic idea is that if favorable polls for Obama are released in North Dakota, then he will "update" the previous polls in South Dakota to be more favorable toward Obama. In practice, he does this in a hideously complicated way, utilizing a LOESS curve fitted on a massive number of dummy variables that are calculated by regression. The regression in this step includes national polls. He explains it better than I could here, here, and here.

5) With the newly adjusted polls are now used to estimate(via regression on demographic data) a formula, which is then used to create "538 estimates" that are treated by the model as a poll. He now calculates a weighted average for of every "poll" for every state.

6) He then estimates how much public opinion can change between now and the election, by looking at data from the 2004 and 2000 elections(See his original post on the subject here). He has probably changed his methodology since then, but Nate hasn't been very open on this front.

It's important to note that he assumes that states will move together, using a state similarity metric in order to estimate covariance. He also assumes that polls will "tighten" as election day approaches.

7) Finally, he simulates 10,000 "elections" using the above model, giving us the summary statistics.

[That is the short version, I've skipped some things...]

Now, I have little quibbles with the individual steps involved(And Stats geeks can see a long rant about them here). But my bigger concern is that his model is too complicated. His model has so many moving parts and implicit assumptions, that it's hard enough to even check it's consistency, let alone accuracy.

StochasticDemocracy

Our model is considerably simpler. At heart, we assume that public opinion follows a random walk. But unfortunately, we don't get to observe public opinion directly, we only observe polls.

But if we want to study public opinion using polls, we can use a Hidden Markov Framework. (See also, Recursive Bayesian Filtering)


A Sample Hidden Markov Process, where x(t) represents public opinion, and y(t) represent observed polls

Statistics enthusiasts may notice that this resembles Kalman Filtering, which has been used extensively for election forecasting. When day to day changes are Normaly distributed , Recursive Bayesian Estimation is equivalent to Kalman Filtering.

But we have reasons, both analytic and empirical, to believe that the distribution is not normal, but Leptokurtic  (Fat-Tailed). In layman terms, this means that "big" events happen more frequently then one would expect using a Normal Distribution. Without accounting for this, "big" events(like the nomination of Sarah Palin or the collapse of the Stock Market) distort our estimates.

So we have implemented a Recursive Bayesian Filter that assumes Leptokurtic disturbances. While I have never seen such a filter applied to political polls, it's been used before in modeling of both financial and economic time series[See Bidarkota and Dupoyet (2005)]. (The filter has been slightly modified to account for little things like the multiple day polling period of polls)

The advantages of this are that under relatively defensible assumptions, this method provides the most efficient and accurate poll aggregation possible given the data available. Better than Pollster's LOESS method, and better than Fivethirtyeight's snapshots.

We use this filter to estimate both the current state of the race, the variance of day to day movement in the state, and the errors involved with our estimates. From there, it is relatively straight-forward algebra to calculate all of the numbers displayed on the site.

My favorite thing about this framework is that it can be expanded heavily while maintaining it's theoretical simplicity. Things like House Effects, Pollster Reliability, and state to state correlations can be easily handled in both simple and transparent ways, with extensions to other types of elections being similarly straightforward. This way, when I make mistakes, it's easy for others to mock them.

Any Questions?

3 comments:

dylan said...

The assumption that the polls follow a random walk is blatantly not true: it is obviously not possible for either candidate's support to go above 100%, for instance. That's an extreme case, but I would certainly expect similar behaviour to happen at more reasonable poll numbers. Concretely, a candidate at 90% support is much more likely to lose support than to gain some. It also seems that your error bars from early in the popular vote polls are much too wide for this reason.

Of course this effect needs to be quantified, but I would be surprised if it is not detectable and relevant for the model. This is why fivethirtyeight expects the polls to tighten, for instance.

David said...

Obviously polls can't perfectly follow a random walk. If they did, than "safe" states like California and New York would be very unlikely.

But I've done the econometric tests for mean reversion, and I've always failed to reject the null hypothesis that the autoregressive coefficient is less than 1 with 5% confidence(Though it's usually pretty close, at p=.03 or .04)

Theoretically, the effect you are talking about has to exist, but empirically it doesn't seem to be very relevant with the margins we deal with.

But FiveThirtyEight does something a little different. It makes the assumption that polls will turn against the person who is ahead in the state right before the election.

While that effect might be there, it isn't on as firm theoretical footing as the one you suggest.

David said...

One other thing, I don't think the error bars are as off as you think.

To some extent, it's always hard to get information about tail events, but...

A 95% confidence interval is incredibly wide. Do you really think that in 20 random elections, taken over the course of 80 years, not a single president would get 65% of the vote? I don't know, but it seems fairly likely.