Mean-centering Does Nothing for Multicollinearity!

Vikram Bath

Vikram Bath is the pseudonym of a former business school professor living in the United States with his wife, daughter, and dog. (Dog pictured.) His current interests include amateur philosophy of science, business, and economics. Tweet at him at @vikrambath1.

Related Post Roulette

44 Responses

  1. Christopher Carr says:

    I don’t see multicollinearity as a problem to begin with.

    In your example, high child income is caused by reading books to one’s kids is caused by high parental IQ, or high child income and reading books to one’s children are both caused by high parental IQ. To truly suss out causality, you would have to run a second experiment or second observational study that controls for one or the other variable.

    I’m with you on hard validity requirements, such as p-values, being quite silly at times, but there are already a surplus of journals and a surplus of submissions to each. The problem in medical research at least, in my experience, stems from the fact that journal reviewers seldom know anything at all – nevermind even the basics – about biostatistics and epidemiology.Report

    • Chris in reply to Christopher Carr says:

      If you’re running experiments, multicollinearity is definitely not a problem. It’s a problem when you’re using regression, which you don’t really need with experiments.Report

      • James K in reply to Chris says:

        @chris

        Multicollinearity isn’t even really a problem with regressions – all of your estimates will be unbiased, consistent and efficient even in the presence of multicollinearity. It does make inferring causation difficult, but inferring causation from a regression is a dicey proposition at the best of times.Report

        • Chris in reply to James K says:

          It’s a problem for regression because it actually screws with the coefficients, sometimes dramatically, making it difficult or, with massive collinearity, impossible to interpret them.Report

          • James K in reply to Chris says:

            @chris

            Multicollinearity doesn’t really screw up the coefficients since the regression process will still generate correct estimates for your coefficients and their variances. It certainly can make it difficult to work out which of your variables is doing the heavy lifting, but the way I usually deal with that is to remove one variable at a time from the model, and note what the effect on overall model fit is.Report

            • Vikram Bath in reply to James K says:

              James K: the regression process will still generate correct estimates for your coefficients and their variances

              This isn’t true. Assuming you’re doing least squares linear regression, yes, you will get the minimum variance, most efficient, unbiased estimator for each coefficient, but those terms are illusions of correctness. If you work with a dataset with a lot of multicollinearity and add or remove variables, the coefficients can jump around dramatically, changing sign in the process while still being significant in one direction and then in the other. Each of those estimates are all “unbiased”, but at most only one can be correct and maybe not even that.Report

              • Christopher Carr in reply to Vikram Bath says:

                Is there anything in statistics – particularly regression analysis – that isn’t just an illusion of correctness? Or, for that matter, the entire notion that there is a correctness out there to be discovered through regression analysis of all techniques?Report

              • Chris in reply to Christopher Carr says:

                I’m not sure what correctness means here. Regression is meant to estimate a few dimensions of interrelatedness between variables, and the “correctness” of those estimates is a mathematical question on the one hand (that is, depending on the type of regression, it is a question of how one determines the “fit” of the estimates to the actual data mathematically) and an empirical one (do the estimates fit with observations, theory, that sort of thing, and when appropriate, do they predict new observations or provide testable hypotheses that are borneI out by future observations).

                Correctness, aside from the methodological/mathematical issues (which, in statistics, can also supervene on empirical ones), is looked at the same way it is in science generally, when doing regression.Report

              • Christopher Carr in reply to Chris says:

                You and I have the same understanding of “correctness”.Report

              • Christopher Carr: the entire notion that there is a correctness out there to be discovered through regression analysis of all techniques?

                I didn’t know you were a postmodernist!

                To me, parameters are correct if they come somewhat close to describing the real-world relationships they are supposed to be estimating. Assuming there is a real world, of course.

                We can indeed argue that everything in statistics is an illusion, but some illusions are more damaging than others. In the case of a model that has misspecification errors, saying “these are guaranteed to be unbiased, most-efficient estimators” is a bigger illusion than some others.

                Where “illusion” is defined by what’s likely to get you in trouble practically speaking.Report

              • Christopher Carr in reply to Vikram Bath says:

                I’d actually consider myself more of a Wittgensteinian, and I’m a methods guy when it comes to research. I would resent your sarcasm if I didn’t find its underlying ignorance so amusing.

                As for the postmodernists, I value their critique in that it forces me to think critically about the validity and accuracy of my study.

                Regression analysis on the other hand is the squishiest technique that exists in statistics, which is a language that attempts to describe experience using mathematical tautologies instead of verbal ones. Nor is regression analysis focused on any sort of “correctness”, unless by “correctness” you really just mean “doing stuff better”. Finally, regression analysis is not capable of establishing any kind of causal relationship.

                Can you improve on models? Yes, but it depends on what your purpose is. One of the goals of regression is to establish independently-associated factors, so in that sense, multicollinearity is something to be reduced or eliminated, but only inasmuch as you can actually do something with what remains.Report

              • Christopher Carr: Wittgensteinian

                I had remembered that after I posted. I’ve never managed to figure out what he was saying or even approximate the list of what he was saying, so I’m totally willing to believe you understand something I don’t here.

                Nor is regression analysis focused on any sort of “correctness”, unless by “correctness” you really just mean “doing stuff better”.

                Well, I’d certainly be willing to settle for the latter. I’m not a fan of saying “well, all of this stuff is not “correct” anyway, so there’s no point in bothering.Report

              • James K in reply to Vikram Bath says:

                @vikram-bath

                But the reason the parameters are jumping around is because the multicollinearity is creating actual uncertainty in the parameters. As for working out which of your seemingly-significant model specifications is the best, that’s what measures of overall fit are for.Report

              • Vikram Bath in reply to James K says:

                James K: But the reason the parameters are jumping around is because the multicollinearity is creating actual uncertainty in the parameters.

                Me: Multicollinearity is a problem because it gives you terrible estimates!
                You: But those estimates are only terrible in a way that reflects the data!

                At this point, I think we’re only disagreeing about what constitutes a problem.Report

              • James K in reply to Vikram Bath says:

                I think you’re right. When I think of a problem for a regression, I think of things that stop OLS working as the best estimator, like endogeneity or non-stationarity.

                But multicollinearity is just like having a small sample size. It’s not great, but you can’t do anything about it and your techniques will still work as well as anything, so what can you do?Report

              • Michael Cain in reply to James K says:

                …so what can you do?

                Go back and think about the underlying problem some more? I always regard multicollinearity as a big red warning flag that I’m doing something badly. Why are these six variables (to choose a number at random) so highly correlated? Which one is fundamental and which ones are secondary? Are they all surrogates for something fundamental that I haven’t considered?Report

        • Vikram Bath in reply to James K says:

          Our current publishing standards do fall short of “reject anything that uses regression”.Report

  2. Chris says:

    I agree with almost everything you say here (except that multicollinearity is a statistical problem as well, in that it makes it difficult to get meaningful statistical results). I’ll just reiterate that I think sexiness for sexiness’ sake is a problem not because it gives the consumers of science what they want, but because it produces an incentive not just to study what they want to hear about, but to produce flashy results so that they’ll stand out. Sexiness isn’t a topic issue, it’s a results issue, and it is one made possible in large part by the focus on statistical significance in publishing.Report

  3. Tod Kelly says:

    “Science has propagated this error far and wide. Aiken and West (1991) now have 25,769 scholarly citations according to Google. That is breathtaking.”

    Is this the case though? And to be clear, I honestly don’t know.

    If means-centering is simply a way to measure — a la, nautical miles as opposed to land-lubber miles — then it only becomes a fallacy if the out-coming number is treated as a special solution rather than a simple measurement. Otherwise, it’s like saying that an experiment that used ounces and cups rather than milliliters and liters is going to give you flawed results.

    Those near-26K citations, are they papers that rely on means-centered measurements to add special and undeserved meaning to the results, or are they merely citing the measurement system they are using?Report

    • Chris in reply to Tod Kelly says:

      There are actual methodological and even theoretical reasons for centering (not necessarily, but often mean centering) variables. If you’re using a linear regression model, what the intercept means depends on how the variables are centered, so to the extent that the intercept is important, centering can be.

      However, centering to avoid multicollinearity issues is problematic because it doesn’t, as Vikrim points out, change the fact that your two variables are highly correlated. You haven’t changed the predictive value of those variables just by subtracting a constant, unless there is some theoretical reason to believe that constant is important in the relationship between the variable(s) and whatever they’re supposed to be predicting.Report

      • Tod Kelly in reply to Chris says:

        If you have an experiment that uses centering, then, do others later verify it with centering as well, or do they just scrap it and pretend it doesn’t exist on the basis that it never should have been run in the first place?

        Again, I’m asking this as someone who doesn’t know.Report

        • Chris in reply to Tod Kelly says:

          It would depend on a lot of things, but presumably if my results are important enough to replicate, I’ve centered for a good reason, and anyone using similar methods to replicate my results would have the same reason.

          I’ll give an example: Imagine we ha a model with two variables, a binary variable for gender in which 1=female and 0=male. We then have another variable, height, which is continuous. If we don’t center height, then our intercept for the height variable in inches (the line for which represents the value controlling for other factors, so with gender=0), then the intercept represents males who are 0 inches tall. This would make no sense, so we would center the variable somehow, probably on the average height.

          If I then got results showing that both gender and height influence pizza consumption in a certain way, then anyone who replicated my results would probably want to similarly center their height variable.Report

    • Vikram Bath in reply to Tod Kelly says:

      Tod,
      Someday, I will learn my own lesson that all analogies suck.

      Mean centering does not hurt. Like you said, you can do all your calculations in nautical miles, and it might even make sense to do that sometimes.

      It isn’t bad to mean center. It is just incorrect to think or claim that mean-centering alleviates any of your real-life problems. The SMJ paper I cited that said “We mean centered predictor variables in all the regression models to minimize multicollinearity” is flat out wrong about what mean-centering does, and the fact that they did it and it got published in a great journal just means it will be all the more likely that others will make the same mistake in the future. Someone will have a problem and think to look and see what these people did and then do that.

      Not all of the 26,000 citations are wrong. Some, like Echambadi and Hess (2004) cite Aiken and West (1991) specifically to counter what they say. Additionally, they did say other things, so some of those citations might be for those other things. In general though, I think most of the citations are inappropriate.Report

  4. James K says:

    I question the ubiquity of the problem you’ve identified Vikram. For one thing, I’ve never heard of either the Variance Inflation Factor, or the method used to “fix” it This may be because anyone who understands how covariance is calculated should be able to tell immediately that it is mathematically impossible for adding a constant to your variable to have any effect on its covariances. The advice I was given for fixing multicollinearity was to not bother.

    As for reifying concepts, I can tell you in economics there are many instances where steps are taken to avoid that problem. For example, every economist is taught the ways that indicators like GDP and employment can increase but in ways that are bad for the economy or society. I see more reification of economic concepts outside of economics than inside it.Report

    • Vikram Bath in reply to James K says:

      James K: The advice I was given for fixing multicollinearity was to not bother.

      That’s not bad advice. “Collect more data” would be better though.

      James K: I see more reification of economic concepts outside of economics than inside it.

      I can agree with this…if for no other reason than how much day-to-day news the media puts out concerning the stock market. But even this is, I think, an example of people who spend more time with a metric being more likely to reify it.Report

      • James K in reply to Vikram Bath says:

        @vikram-bath

        I think the problem of reification is most acute with people who use something a lot, but lack a deep understanding of it. I know exactly how GDP is calculated, so I have a good handle on what it can and can’t do. People who don’t have that knowledge will be more likely to think of it as a black box, rather than a collection of parts.Report

    • Chris in reply to James K says:

      Virtually any statistical package will output VIFs, conditial indices, and/or tolerance (1-r2) as standard measures of collinearity.Report

    • Troublesome Frog in reply to James K says:

      On the economist front, I agree. I’ve seen a lot of people make the mistake referred to in that link, but I’ve never seen an economist make it. I’ve seen that mistake ascribed to economists who aren’t really making it, though.Report

    • Vikram Bath in reply to James K says:

      James K: This may be because anyone who understands how covariance is calculated should be able to tell immediately that it is mathematically impossible for adding a constant to your variable to have any effect on its covariances.

      Well, that’s why I lost my temper and called my wife stupid. I share your sense of “obviousness” about the whole matter. Yet, Aiken and West got published by Sage, which people (including me) still regularly use as a go-to resource for their empirical problems. And since then thousands of researchers, many of whom are much smarter, better published, and probably better looking than me have been duped by this idea that seems obvious without too much reflection needed.

      And as I mentioned, the worst bit is that it doesn’t seem correctable.Report

      • Chris in reply to Vikram Bath says:

        This is true if all you’re doing is dealing with two independent variables, but the mean-centering is used for interaction terms, of which the mean-centered variable will be a constituent. It’s pretty easy to see, mathematically, how this does affect covariance. That’s not the issue. If it was, no one would use this method, because these are not stupid people.Report

        • Vikram Bath in reply to Chris says:

          Sorry, yes, the interaction covariances change. When it comes to reducing multicollinearity (or even changing the point estimates of the regression), these changes are simply offset elsewhere. From Echambadi and Hess (where “x1x2” is an interaction term):

          Mean-centering not only reduces the covariance between x1 and x1x2, which is “good,” but it also reduces the variance of the exogenous variable x1x2, which is “bad.” For accurate measurement of the slope of the relationship, we need the exogenous variables to sweep out a large set of values; however, mean- centered (x1- x 1) (x2- x 2) has a smaller spread than x1x2. When both the improvement in collinearity and the deterioration of exogenous variable spread are considered, mean-centering provides no change in the accuracy with which the regression coefficients are estimated. The complete analysis of mean-centering shows that mean-centering neither helps nor hurts moderated regression.

          Report

        • James K in reply to Chris says:

          @chris

          Why would you worry about multicollinearity between a variable and its interaction terms? In the case, the interpretability problem can be straightforwardly solved by reporting partial derivatives instead of coefficients.Report

          • Chris in reply to James K says:

            I admit I don’t know how that will help. Could you flesh it out?

            I’ve run into collinearity issues in my own modeling work mostly when I have sample size issues. Say I am looking at location and IQ, for example, and using dummies for each of several locations. I only have a few people from Capitol Hill, and they’re all idiots, so I can’t get any meaningful coefficients for Capitol Hill. But tossing the location is dicey for methodological reasons, if nothing else (it’s worse if Capitol Hill is the location I care about, obviously).

            This is less a statistical problem than a data collection problem, but if I’m using regression, it’s likely I had nothing to do with the data collection, and I’m stuck with what they give me.Report

  5. You’ve ruled it down to either being parental IQ inherited by children or parents reading books to their kids more

    And you left out the parents’ having the contacts to get their kids into good prep schools and colleges and find them good jobs because …?Report

    • It probably works against my political worldview!

      Since it was just an example, I just wanted it to only have two independent variables.Report

      • Brandon Berg in reply to Vikram Bath says:

        Not sure of the specific study, if any, you had in mind, but doesn’t the effect go much further down the SES scale than is consistent with that being the primary explanation?Report

        • I didn’t have any particular study in mind.

          I think it depends on what Mike considers as “contacts”. Knowing anyone who is considered to work in the professions could be considered a “contact”. Though I agree the way he put it, it sounds like knowing people who can just get someone a job or college admission at their suggestion.Report

          • Or at least materially affect their kids’ chances of success by knowing who to talk to.

            Though, at a lower SES level, having been through the admissions process yourself and having friends whose kids went through it recently help as well.Report

  6. Michael Cain says:

    I’m impressed that there are 40 comments (mostly) on the subject of multicollinearity.Report

  7. Kim says:

    What worries me is not what gets published, but what doesn’t.
    So much research that you don’t even hear about.Report