By Fester:
During the past couple of weeks, most of my time at work has been consumed by a massive and hideous data set that promises some really interesting questions once I have made it tractable. This is a beast of a data set, several tens of thousands of variables at multiple time points and a few million pieces of information. It has been a massive pain in the ass.
One of the annoying portions of this task is that I have to keep my co-workers away from the data set until we are ready with a detailed analysis plan. This is to make sure that we are investigating pathways that make sense. For instance it makes no sense for us to investigate Individual Favorite Color and Individual Height, these two (made-up) variables may or may not be related in a statistically significant way but answering that question leads us nowhere.
Another chunk of this task is deciding what types of p-values will be critical levels. This is important because if we did a massive X thousand by X thousand correlation table and assume that the critical p-value is .05 (so that there is a 95% chance that the observed relationship is not due to random chance), we should see hundreds of 'significant correlations' that are due to pure chance.
Okay, let me back up for a second and talk a bit about p-values. There are a lot of things that are produced in a statistical analysis, but if you need to go quick and dirty and just look at one thing, the p-value is the thing you look at to determine if you need to go any further into the analysis. P tells you (roughly) what the probability is that the data distribution that is being tested could be the result of random chance. The higher the P-value, the more likely the data distribution is due to chance or sampling error. The lower the P-value, the less likely the data distribution is due to chance. This is a massive simplification but it is close enough --- the lower the P-value, the more likely that something "interesting" is going on in the data.
There are some relationships where .05 is good enough, but in most cases, p=.025 or .01 will be enough for the team to really feel confident that there is something actually going on and not just noise in the data set. There may be a couple of cases where p=.001 before we are truly confident, but those will be rare cases. These decisions will exclude some valid but weak results, but it it weed out plenty of bull-shit results. We are increasing our confidence intervals because the data set is a monster and we know that we could generate plenty of 'significant' results that don't mean much beyond being artifacts of the data if we go with the baseline academic standard of significance at p=.05.
If I'm looking at a series of potential relationships within this data set that can not produce a p-value of less than .125, I'm laughing at there is nothing statistically there. I can write off the relationship as due to random chance.
Well, the right wingers can not as they are producing highly misleading "analysis" that is asserting a statistically significant correlation that suggests/implies the Chrysler dealerships are being closed as a Clinton retribution scheme, not an Obama scheme, but a Clinton scheme! Besides using a tool that is way more complex than need be the regression that was produced showing a "Clinton effect" had a p-value of .125, at which point they claimed initial significance.
There is nothing there, statistically. And even if the P-value for any of their hypothesis was less than or equal to .05, that is just the start of a good explanation or a scandal if that is what they are chasing. The next step would be to take the statistical data as the initial scent and push forward to find a coherent and logical mechanism with tangible evidence of a deliberate action. Running a few statistical tests and proclaiming victory without a working model that produces those results with a significant P is just damn lazy work.
Jeez Fester, were you away from the internet for awhile and forget to hit the "post" button on a bunch of saved posts? Spread 'em out a little, eh!
ReplyDeleteOn an editorial note, many of us are not too current in our statistical modeling, if we ever were, and don't have the foggiest clue as to what p-values are. I can get the gist of what you mean from the context, but a brief explanation of what the term means would greatly enhance the readability of the post, IMHO.