Mean-centering Does Nothing for Multicollinearity!
I called my wife stupid yesterday, and I have yet to take it back and don’t think I will. Let me explain why.
There is a problem called multicollinearity:
Multicollinearity refers to a situation in which two or more explanatory variables in a multiple regression model are highly linearly related.
Wikipedia incorrectly refers to this as a problem “in statistics”. It is a statistics problem in the same way a car crash is a speedometer problem. Multicollinearity is actually a life problem and statistics measures how bad it is.
An example of multicollinearity that we’ve talked about here at OT is the effect of parental income on the incomes of their grown children. Parental income is likely to correlate with several things (the quality of the schools the student is likely to attend, accessibility of books and other resources, supplemental activities available, inherited student IQ, etc.). Without careful analysis, it can be difficult to determine what is actually leading to the children’s high incomes. Most just choose whichever explanations best complement their political worldviews.
I repeat: multicollinearity is not a statistics problem. It is a problem that has existed since before there were numbers. You won’t know whether the rooster crows because of the light or because it becomes warmer until you see whether it crows on a morning when the sunrise is accompanied by the arrival of a cold front. Until then, it could be one or the other waking the rooster.*
One of the things people do is use numbers to measure their problems. For multicollinearity, the standard metric is variance inflation factors (VIFs).
In general, high VIFs indicate you have significant multicollinearity problems while low ones indicate you might not. The measure is not perfect, and despite what you might read out there, there is no cut-off value for what is considered acceptably low.
Then in 1991, Leona Aiken and Stephen West screwed everything up. They weren’t the first. Jacquard, Wan, and Turrisi (1990) made the same screw up. It was the 90s, I guess.
What they observed was that if you take the predictor variables and mean-center them, then the VIFs tend to go down. Mean-centering is where you subtract the average from each of the data points.
Paint the picture in your mind now. You are trying to figure out why kids of rich parents end up with higher incomes. You’ve ruled it down to either being parental IQ inherited by children or parents reading books to their kids more. But almost all the parents you talked to with high IQ also read books to their kids and none of the parents with low IQ read book to their kids, so you can’t confidently say which one it is.
Then, you read Aiken and West and follow their recommendations. You subtract the mean IQ of 100 from each of the IQ scores. (So an IQ of 90 is now -10 and an IQ of 110 is now +10.) Also, you subtract the mean books read per month of 3 from each of the books-read scores. (So reading 0 books per month is now -3 and reading 7 books per month is +4.)
Then you calculate your VIFs and they indeed went way down. Have you solved your multicollinearity problem?
No. Changing the scale of IQ and number of books read didn’t actually give you more or better data. The root problem of your not having any high-IQ parents who didn’t read or any low-IQ parents who did read is still there. All you did was succeed in changing the metric used to measure the problem: the VIFs.
This is the equivalent of trying to reduce the severity of a car accident by switching your speedometer from miles per hour to nautical miles per hour. Your numbers will change to sound acceptably lower, but you are still in exactly the same situation you were in before. Changing the numbers used to describe a problem doesn’t change the problem.
Science has propagated this error far and wide. Aiken and West (1991) now have 25,769 scholarly citations according to Google. That is breathtaking.
Here is a gated paper from the May issue of Strategic Management Journal, which is in my humble opinion the best business journal to actually read. The authors, who are each no doubt talented scholars, nevertheless make the same error as those before them:
We mean centered predictor variables in all the regression models to minimize multicollinearity (Aiken and West, 1991). The variance inflation factors for all independent variables were below the recommended level of 10.
These are smart people doing something stupid in public. They have reified a statistical measure of multicollinearity and mistaken it for actual multicollinearity. They should each feel bad, but they are not close to being unique. I pick on them only because they are recent and in a prestigious journal I have immediate access to.
Contrary to Troublesome Frog’s claim in a comment a couple of years ago, academics reify concepts all the time. It is entirely likely and plausible that an economist may spend so much time and effort and passion thinking about how to increase employment that he has forgotten what employment was supposed to be good for in the first place. That’s exactly the kind of mistake whose likelihood increases with the amount of time and enthusiasm spent working on a problem.
I read Chris’s series on the Michael LaCour scandal with interest. That error, once discovered, seems to have been dealt with admirably by all involved.
But what do you do about Aiken and West (1991)? Their work has the momentum of a thousand suns. Raj Echambadi and James Hess wrote an article titled quite clearly “Mean-Centering Does Nothing for Moderated Multiple Regression” in the highly regarded and widely read Journal of Marketing Research:
…we show the following: 1) in contrast to Aiken and West’s (1991) suggestion, mean-centering does not improve the accuracy of numerical computation of statistical parameters, 2) it does not change the sampling accuracy of main effects, simple effects, and/or interaction effects (point estimates and standard errors are identical with or 4 without mean-centering), and 3) it does not change overall measures of fit such as R 2 adjusted-R . It does not hurt, but it does not help, not one iota. [Vik: emphasis added]
That was in 2004, and it was to no avail. Aiken and West (1991) shows no sign of slowing down. Correct information in this case doesn’t appear to displace incorrect information.
This should not be surprising. No one will particularly miss LaCour. He didn’t make anyone’s life easier. Aiken and West (1991), in contrast, is a very useful tool in every empiricist’s toolbox. Why would researchers throw away something that helps them get published? For that, Echambadi and Hess would have to publish what to do instead, and it’d have to be at least as easy as Aiken and West’s ineffectual approach.
The dirty secret is that no one is really interested in whether their data really has multicollinearity or not. Why would people invest a lot of time and effort trying to identify problems in their own data? They just want a way to get past all the hoops so they can report their results.
In this way, I differ from Chris, who seems to see great danger in the search for sexy results. This critique, in my view, is difficult to differentiate from “stop looking at problems that matter to people!” He says “it creates perverse incentives for researchers”, but these are the exact same incentives that convince researchers to work on anything meaningful. And what really should consumers of research do? Should we not pay attention to articles that are interesting and have implications for society? That seems as unwise as it is untenable.
What worries me isn’t a march towards sexiness. It’s a march towards publishing irrespective of whether someone actually believes what they are publishing is true or not. We have a system that rewards statistical significance rather than honesty. We do have checkboxes in place that are designed to keep out those seeking to hack their way into a journal. The most famous is the requirement that p values be less than 0.05. Almost everyone, however, seems to regard this as a bureaucratic hoop rather than a way to sort between true and false claims. If you don’t like your results on the first run, add or subtract control variables until you do. If that doesn’t work, there are a dozen other options available.
Similarly, checks for measurement reliability, validity, and multicollinearity are made only because they are a burden to be borne. They are executed by people thinking “how do I pass this validity test?”, not by those thinking “does my data pass this validity test?” If you run into problems, you search for a different test that you can pass or do different manipulations until you get through. In fields where collecting data can take years, not publishing work from a dataset is not an option. In my view, this is a greater threat to science than data fabrication.
* Or an internal clock. Or a temperature change whether it is in a positive or a negative direction. I don’t really know much about roosters.
I don’t see multicollinearity as a problem to begin with.
In your example, high child income is caused by reading books to one’s kids is caused by high parental IQ, or high child income and reading books to one’s children are both caused by high parental IQ. To truly suss out causality, you would have to run a second experiment or second observational study that controls for one or the other variable.
I’m with you on hard validity requirements, such as p-values, being quite silly at times, but there are already a surplus of journals and a surplus of submissions to each. The problem in medical research at least, in my experience, stems from the fact that journal reviewers seldom know anything at all – nevermind even the basics – about biostatistics and epidemiology.Report
If you’re running experiments, multicollinearity is definitely not a problem. It’s a problem when you’re using regression, which you don’t really need with experiments.Report
@chris
Multicollinearity isn’t even really a problem with regressions – all of your estimates will be unbiased, consistent and efficient even in the presence of multicollinearity. It does make inferring causation difficult, but inferring causation from a regression is a dicey proposition at the best of times.Report
It’s a problem for regression because it actually screws with the coefficients, sometimes dramatically, making it difficult or, with massive collinearity, impossible to interpret them.Report
@chris
Multicollinearity doesn’t really screw up the coefficients since the regression process will still generate correct estimates for your coefficients and their variances. It certainly can make it difficult to work out which of your variables is doing the heavy lifting, but the way I usually deal with that is to remove one variable at a time from the model, and note what the effect on overall model fit is.Report
This isn’t true. Assuming you’re doing least squares linear regression, yes, you will get the minimum variance, most efficient, unbiased estimator for each coefficient, but those terms are illusions of correctness. If you work with a dataset with a lot of multicollinearity and add or remove variables, the coefficients can jump around dramatically, changing sign in the process while still being significant in one direction and then in the other. Each of those estimates are all “unbiased”, but at most only one can be correct and maybe not even that.Report
Is there anything in statistics – particularly regression analysis – that isn’t just an illusion of correctness? Or, for that matter, the entire notion that there is a correctness out there to be discovered through regression analysis of all techniques?Report
I’m not sure what correctness means here. Regression is meant to estimate a few dimensions of interrelatedness between variables, and the “correctness” of those estimates is a mathematical question on the one hand (that is, depending on the type of regression, it is a question of how one determines the “fit” of the estimates to the actual data mathematically) and an empirical one (do the estimates fit with observations, theory, that sort of thing, and when appropriate, do they predict new observations or provide testable hypotheses that are borneI out by future observations).
Correctness, aside from the methodological/mathematical issues (which, in statistics, can also supervene on empirical ones), is looked at the same way it is in science generally, when doing regression.Report
You and I have the same understanding of “correctness”.Report
I didn’t know you were a postmodernist!
To me, parameters are correct if they come somewhat close to describing the real-world relationships they are supposed to be estimating. Assuming there is a real world, of course.
We can indeed argue that everything in statistics is an illusion, but some illusions are more damaging than others. In the case of a model that has misspecification errors, saying “these are guaranteed to be unbiased, most-efficient estimators” is a bigger illusion than some others.
Where “illusion” is defined by what’s likely to get you in trouble practically speaking.Report
I’d actually consider myself more of a Wittgensteinian, and I’m a methods guy when it comes to research. I would resent your sarcasm if I didn’t find its underlying ignorance so amusing.
As for the postmodernists, I value their critique in that it forces me to think critically about the validity and accuracy of my study.
Regression analysis on the other hand is the squishiest technique that exists in statistics, which is a language that attempts to describe experience using mathematical tautologies instead of verbal ones. Nor is regression analysis focused on any sort of “correctness”, unless by “correctness” you really just mean “doing stuff better”. Finally, regression analysis is not capable of establishing any kind of causal relationship.
Can you improve on models? Yes, but it depends on what your purpose is. One of the goals of regression is to establish independently-associated factors, so in that sense, multicollinearity is something to be reduced or eliminated, but only inasmuch as you can actually do something with what remains.Report
I had remembered that after I posted. I’ve never managed to figure out what he was saying or even approximate the list of what he was saying, so I’m totally willing to believe you understand something I don’t here.
Nor is regression analysis focused on any sort of “correctness”, unless by “correctness” you really just mean “doing stuff better”.
Well, I’d certainly be willing to settle for the latter. I’m not a fan of saying “well, all of this stuff is not “correct” anyway, so there’s no point in bothering.Report
@vikram-bath
But the reason the parameters are jumping around is because the multicollinearity is creating actual uncertainty in the parameters. As for working out which of your seemingly-significant model specifications is the best, that’s what measures of overall fit are for.Report
Me: Multicollinearity is a problem because it gives you terrible estimates!
You: But those estimates are only terrible in a way that reflects the data!
At this point, I think we’re only disagreeing about what constitutes a problem.Report
I think you’re right. When I think of a problem for a regression, I think of things that stop OLS working as the best estimator, like endogeneity or non-stationarity.
But multicollinearity is just like having a small sample size. It’s not great, but you can’t do anything about it and your techniques will still work as well as anything, so what can you do?Report
…so what can you do?
Go back and think about the underlying problem some more? I always regard multicollinearity as a big red warning flag that I’m doing something badly. Why are these six variables (to choose a number at random) so highly correlated? Which one is fundamental and which ones are secondary? Are they all surrogates for something fundamental that I haven’t considered?Report
Our current publishing standards do fall short of “reject anything that uses regression”.Report
That’s very progressive of you.Report
I agree with almost everything you say here (except that multicollinearity is a statistical problem as well, in that it makes it difficult to get meaningful statistical results). I’ll just reiterate that I think sexiness for sexiness’ sake is a problem not because it gives the consumers of science what they want, but because it produces an incentive not just to study what they want to hear about, but to produce flashy results so that they’ll stand out. Sexiness isn’t a topic issue, it’s a results issue, and it is one made possible in large part by the focus on statistical significance in publishing.Report
“Science has propagated this error far and wide. Aiken and West (1991) now have 25,769 scholarly citations according to Google. That is breathtaking.”
Is this the case though? And to be clear, I honestly don’t know.
If means-centering is simply a way to measure — a la, nautical miles as opposed to land-lubber miles — then it only becomes a fallacy if the out-coming number is treated as a special solution rather than a simple measurement. Otherwise, it’s like saying that an experiment that used ounces and cups rather than milliliters and liters is going to give you flawed results.
Those near-26K citations, are they papers that rely on means-centered measurements to add special and undeserved meaning to the results, or are they merely citing the measurement system they are using?Report
There are actual methodological and even theoretical reasons for centering (not necessarily, but often mean centering) variables. If you’re using a linear regression model, what the intercept means depends on how the variables are centered, so to the extent that the intercept is important, centering can be.
However, centering to avoid multicollinearity issues is problematic because it doesn’t, as Vikrim points out, change the fact that your two variables are highly correlated. You haven’t changed the predictive value of those variables just by subtracting a constant, unless there is some theoretical reason to believe that constant is important in the relationship between the variable(s) and whatever they’re supposed to be predicting.Report
If you have an experiment that uses centering, then, do others later verify it with centering as well, or do they just scrap it and pretend it doesn’t exist on the basis that it never should have been run in the first place?
Again, I’m asking this as someone who doesn’t know.Report
It would depend on a lot of things, but presumably if my results are important enough to replicate, I’ve centered for a good reason, and anyone using similar methods to replicate my results would have the same reason.
I’ll give an example: Imagine we ha a model with two variables, a binary variable for gender in which 1=female and 0=male. We then have another variable, height, which is continuous. If we don’t center height, then our intercept for the height variable in inches (the line for which represents the value controlling for other factors, so with gender=0), then the intercept represents males who are 0 inches tall. This would make no sense, so we would center the variable somehow, probably on the average height.
If I then got results showing that both gender and height influence pizza consumption in a certain way, then anyone who replicated my results would probably want to similarly center their height variable.Report
Tod,
Someday, I will learn my own lesson that all analogies suck.
Mean centering does not hurt. Like you said, you can do all your calculations in nautical miles, and it might even make sense to do that sometimes.
It isn’t bad to mean center. It is just incorrect to think or claim that mean-centering alleviates any of your real-life problems. The SMJ paper I cited that said “We mean centered predictor variables in all the regression models to minimize multicollinearity” is flat out wrong about what mean-centering does, and the fact that they did it and it got published in a great journal just means it will be all the more likely that others will make the same mistake in the future. Someone will have a problem and think to look and see what these people did and then do that.
Not all of the 26,000 citations are wrong. Some, like Echambadi and Hess (2004) cite Aiken and West (1991) specifically to counter what they say. Additionally, they did say other things, so some of those citations might be for those other things. In general though, I think most of the citations are inappropriate.Report
I question the ubiquity of the problem you’ve identified Vikram. For one thing, I’ve never heard of either the Variance Inflation Factor, or the method used to “fix” it This may be because anyone who understands how covariance is calculated should be able to tell immediately that it is mathematically impossible for adding a constant to your variable to have any effect on its covariances. The advice I was given for fixing multicollinearity was to not bother.
As for reifying concepts, I can tell you in economics there are many instances where steps are taken to avoid that problem. For example, every economist is taught the ways that indicators like GDP and employment can increase but in ways that are bad for the economy or society. I see more reification of economic concepts outside of economics than inside it.Report
That’s not bad advice. “Collect more data” would be better though.
I can agree with this…if for no other reason than how much day-to-day news the media puts out concerning the stock market. But even this is, I think, an example of people who spend more time with a metric being more likely to reify it.Report
@vikram-bath
I think the problem of reification is most acute with people who use something a lot, but lack a deep understanding of it. I know exactly how GDP is calculated, so I have a good handle on what it can and can’t do. People who don’t have that knowledge will be more likely to think of it as a black box, rather than a collection of parts.Report
Agreed wholeheartedly.Report
Virtually any statistical package will output VIFs, conditial indices, and/or tolerance (1-r2) as standard measures of collinearity.Report
On the economist front, I agree. I’ve seen a lot of people make the mistake referred to in that link, but I’ve never seen an economist make it. I’ve seen that mistake ascribed to economists who aren’t really making it, though.Report
Well, that’s why I lost my temper and called my wife stupid. I share your sense of “obviousness” about the whole matter. Yet, Aiken and West got published by Sage, which people (including me) still regularly use as a go-to resource for their empirical problems. And since then thousands of researchers, many of whom are much smarter, better published, and probably better looking than me have been duped by this idea that seems obvious without too much reflection needed.
And as I mentioned, the worst bit is that it doesn’t seem correctable.Report
This is true if all you’re doing is dealing with two independent variables, but the mean-centering is used for interaction terms, of which the mean-centered variable will be a constituent. It’s pretty easy to see, mathematically, how this does affect covariance. That’s not the issue. If it was, no one would use this method, because these are not stupid people.Report
Sorry, yes, the interaction covariances change. When it comes to reducing multicollinearity (or even changing the point estimates of the regression), these changes are simply offset elsewhere. From Echambadi and Hess (where “x1x2” is an interaction term):
Report
Oh right, I get that. That is less obvious than the simple mathematical relationship, though, which is why even statisticians still mistakenly use it.Report
@chris
Why would you worry about multicollinearity between a variable and its interaction terms? In the case, the interpretability problem can be straightforwardly solved by reporting partial derivatives instead of coefficients.Report
I admit I don’t know how that will help. Could you flesh it out?
I’ve run into collinearity issues in my own modeling work mostly when I have sample size issues. Say I am looking at location and IQ, for example, and using dummies for each of several locations. I only have a few people from Capitol Hill, and they’re all idiots, so I can’t get any meaningful coefficients for Capitol Hill. But tossing the location is dicey for methodological reasons, if nothing else (it’s worse if Capitol Hill is the location I care about, obviously).
This is less a statistical problem than a data collection problem, but if I’m using regression, it’s likely I had nothing to do with the data collection, and I’m stuck with what they give me.Report
@chris
That is definitely a data collection problem, there’s not really anything you can do about that from a methodological point of view.Report
And you left out the parents’ having the contacts to get their kids into good prep schools and colleges and find them good jobs because …?Report
It probably works against my political worldview!
—
Since it was just an example, I just wanted it to only have two independent variables.Report
Not sure of the specific study, if any, you had in mind, but doesn’t the effect go much further down the SES scale than is consistent with that being the primary explanation?Report
I didn’t have any particular study in mind.
I think it depends on what Mike considers as “contacts”. Knowing anyone who is considered to work in the professions could be considered a “contact”. Though I agree the way he put it, it sounds like knowing people who can just get someone a job or college admission at their suggestion.Report
Or at least materially affect their kids’ chances of success by knowing who to talk to.
Though, at a lower SES level, having been through the admissions process yourself and having friends whose kids went through it recently help as well.Report
I’m impressed that there are 40 comments (mostly) on the subject of multicollinearity.Report
What worries me is not what gets published, but what doesn’t.
So much research that you don’t even hear about.Report