The recent upset election of Donald J. Trump came as a surprise to many of those following the polls. Full disclosure of personal bias: I was quite upset at the results of this election too. But this is not a blog about politics, but rather statistics, so I will not dive into personal or political views.
I will discuss a side tangent relating to recent (as of 11/23/2016) accusations of election fraud. In a nutshell, it was found that there was a statistically significant difference between counties that used electronic polling and non electronic polling in swing states, favoring Trump in counties that used electronic polling. In response, Nate Silver pointed out that while he could recreate the results in the accusation, they disappeared when you adjust other covariates (covariates he listed: white, coll_degree and _cons are the other variables he listed. Not sure what _cons is, maybe percent conservative?).
A lot to talk about there, but I'll skip to the next step: lots of Twitterers saying things like "Linear regression is not appropriate, you need to use a more sophisticated model!".
This is one of the most tricky parts of practicing statistics. You have some data. You run your model. If your results are not what you wanted, do you re-examine your model and try again with a more appropriate model?
My answer: only with a lot of thought and care. A researcher who does this must realize that they are beginning down the "Garden of Forking Paths". Does this mean it should never be done? No. In fact, we should realize that this is what Nate Silver did (rerun the analysis with what he considered could be confounding variables). And you may have picked up that I support what Nate Silver did, yet do not endorse rerunning with all the different models under the sun. Why?
To see why I draw this line, it helps to think like a Bayesian, even if you use frequentist tools. In particular, conditional on selecting the correct prior, Bayesian tests actually answer the true question of interest ("what is the probability that H_a is true?"). Furthermore, again conditional on selecting the correct prior, Bayesian analyses need not worry about the issue of multiple comparisons. Keeping this in mind, "informal Bayesian statistics" can be done by looking at Frequentist results and augmenting by your prior (i.e. if p = 0.03 for something you consider nearly impossible, it's probably false, while p = 0.23 for a hypothesis that is very likely true, it's probably true). Personally, I believe that all the founding fathers of frequentist statistics intended frequentist methods to be used in this way. The only reason for not formally doing a Bayesian analysis is that expressing all your prior knowledge as a simple probability distribution can be a task in futility (side note: this is why I am extremely indifferent about Frequentist vs Bayesian arguments. Being a hardliner in either camp implies to me that you like taking down straw men arguments).
How does this apply to the "more sophisticated analysis" question? Anytime you are deciding whether to do a more sophisticated analysis, you need to make an honest assessment about your prior belief in how much more appropriate the more sophisticated model compared with the "naive" model. In the voter irregularities case, my priors are as such:
1.) P(Nate Silver's confounding factors are real) = high.
The idea that socio-economic status of a county is correlated with technological level of polling in that county is a no-brainer. Furthermore, that socio-economic status is correlated with political preference is also a no-brainer.
2.) P(Linear regression is so inappropriate that it would lead to completely incorrect results) = very low.
Linear regression can be annoyingly robust (annoying if you're a research statistician trying to justify new statistical methodology). We all know that linear regression can fail miserably if an effect is moderately non-linear and we are making predictions far away from the central covariate space. Similarly, if variance is non-constant, we can have invalid standard errors. Neither of these should be of high concern in this case! Note that the variable of interest is a dummy variable ("used electronic polling"). If we wanted to get really paranoid, we could say that even though our covariate does not suffer from non-linearity, other covariates might and thus introduce bias into the "used electronic polling". But for several reasons, I believe this is fairly foolish. Linear regression is quite robust in the case of mild non-linearity, and one would expect these covariates to have a near linear effect. The non-constant variance is of real concern, as I would guess that the sample sizes from each county are not equivalent. However, it is worth noting that this would not change the final inference decision; the estimate should be unbiased, but the standard error would likely be higher. Given that the estimated effect is extremely close to zero (0.01 standard deviations away from zero, for perspective), if our standard errors are incorrect, it would be almost impossible that using the appropriate standard error with an estimated effect of 0.015 would suddenly be a large piece of evidence once the standard errors are fixed up.
So from an informal Bayesian perspective, I would need a lot of evidence from a more sophisticated model to be convinced that we had good evidence of the electronic polls being hacked. And keep in mind that as researchers tried more and more "sophisticated" models until they found the results they want, I would put lower and lower prior probability that the next "sophisticated" model is appropriate.
If this doesn't bother you, go ahead an apply more and more sophisticated models. But just be warned that you're on a fishing expedition.
I will discuss a side tangent relating to recent (as of 11/23/2016) accusations of election fraud. In a nutshell, it was found that there was a statistically significant difference between counties that used electronic polling and non electronic polling in swing states, favoring Trump in counties that used electronic polling. In response, Nate Silver pointed out that while he could recreate the results in the accusation, they disappeared when you adjust other covariates (covariates he listed: white, coll_degree and _cons are the other variables he listed. Not sure what _cons is, maybe percent conservative?).
A lot to talk about there, but I'll skip to the next step: lots of Twitterers saying things like "Linear regression is not appropriate, you need to use a more sophisticated model!".
This is one of the most tricky parts of practicing statistics. You have some data. You run your model. If your results are not what you wanted, do you re-examine your model and try again with a more appropriate model?
My answer: only with a lot of thought and care. A researcher who does this must realize that they are beginning down the "Garden of Forking Paths". Does this mean it should never be done? No. In fact, we should realize that this is what Nate Silver did (rerun the analysis with what he considered could be confounding variables). And you may have picked up that I support what Nate Silver did, yet do not endorse rerunning with all the different models under the sun. Why?
To see why I draw this line, it helps to think like a Bayesian, even if you use frequentist tools. In particular, conditional on selecting the correct prior, Bayesian tests actually answer the true question of interest ("what is the probability that H_a is true?"). Furthermore, again conditional on selecting the correct prior, Bayesian analyses need not worry about the issue of multiple comparisons. Keeping this in mind, "informal Bayesian statistics" can be done by looking at Frequentist results and augmenting by your prior (i.e. if p = 0.03 for something you consider nearly impossible, it's probably false, while p = 0.23 for a hypothesis that is very likely true, it's probably true). Personally, I believe that all the founding fathers of frequentist statistics intended frequentist methods to be used in this way. The only reason for not formally doing a Bayesian analysis is that expressing all your prior knowledge as a simple probability distribution can be a task in futility (side note: this is why I am extremely indifferent about Frequentist vs Bayesian arguments. Being a hardliner in either camp implies to me that you like taking down straw men arguments).
How does this apply to the "more sophisticated analysis" question? Anytime you are deciding whether to do a more sophisticated analysis, you need to make an honest assessment about your prior belief in how much more appropriate the more sophisticated model compared with the "naive" model. In the voter irregularities case, my priors are as such:
1.) P(Nate Silver's confounding factors are real) = high.
The idea that socio-economic status of a county is correlated with technological level of polling in that county is a no-brainer. Furthermore, that socio-economic status is correlated with political preference is also a no-brainer.
2.) P(Linear regression is so inappropriate that it would lead to completely incorrect results) = very low.
Linear regression can be annoyingly robust (annoying if you're a research statistician trying to justify new statistical methodology). We all know that linear regression can fail miserably if an effect is moderately non-linear and we are making predictions far away from the central covariate space. Similarly, if variance is non-constant, we can have invalid standard errors. Neither of these should be of high concern in this case! Note that the variable of interest is a dummy variable ("used electronic polling"). If we wanted to get really paranoid, we could say that even though our covariate does not suffer from non-linearity, other covariates might and thus introduce bias into the "used electronic polling". But for several reasons, I believe this is fairly foolish. Linear regression is quite robust in the case of mild non-linearity, and one would expect these covariates to have a near linear effect. The non-constant variance is of real concern, as I would guess that the sample sizes from each county are not equivalent. However, it is worth noting that this would not change the final inference decision; the estimate should be unbiased, but the standard error would likely be higher. Given that the estimated effect is extremely close to zero (0.01 standard deviations away from zero, for perspective), if our standard errors are incorrect, it would be almost impossible that using the appropriate standard error with an estimated effect of 0.015 would suddenly be a large piece of evidence once the standard errors are fixed up.
So from an informal Bayesian perspective, I would need a lot of evidence from a more sophisticated model to be convinced that we had good evidence of the electronic polls being hacked. And keep in mind that as researchers tried more and more "sophisticated" models until they found the results they want, I would put lower and lower prior probability that the next "sophisticated" model is appropriate.
If this doesn't bother you, go ahead an apply more and more sophisticated models. But just be warned that you're on a fishing expedition.