A couple of weeks ago I wrote to the list seeking input on how best to
calculate a correlation coefficient when you are controlling for
something: in effect, what kind of partial correlation should you
use? See
http://sports.groups.yahoo.com/group/sportscience/message/2614 for
the original message.
Frank Katch replied with contact info on two statisticians he thought
could help, Dave Hosmer and Stan Lemeshow. I contacted them and
both very kindly replied, although they didn't resolve the issue for
me. Ian Shrier also sought input from one of his senior
epidemiologist colleagues (unnamed), and I had some valuable interactions
with Ian as well. Their replies appear below.
To revisit the question, consider this example. You are interested
in the effect of physical activity on health. You do a
cross-sectional study in which you measure health, physical activity, and
various other things that you know you ought to measure, because they
might also predict health, and anyway, other people measure them so you'd
better, too. In particular, you measure socioeconomic status (SES)
and find that SES and physical activity are both positively correlated
with health. Further, you find quite a strong correlation between
SES and activity. (A substantial correlation between predictor
variables is called substantial collinearity, by the way.)
Now, people on high SES eat good food, live in toxin-free classy parts of
town, read Time, and think they're alpha in every way. All these
things could account for their good health. Oh, and they do a lot
of good-quality deliberate exercise, but that might have nothing to do
with their good health. It's all those other things that go with
high SES. How do you analyze your data to address this potential
for the effect of activity on health to be "confounded" by
SES? By doing a multiple linear regression, of course. The
effect of activity in a multiple linear regression that includes SES is
the effect of activity "controlled for" SES; that is, the
effect with SES effectively held constant. But how do you express
the magnitude of the resulting effect of activity on health? That's
was the substance of my query to the list.
I finally answered this question to my own satisfaction by doing some
simulations. I generated two predictor variables, X1 and X2, that
were like SES and activity: correlated with each other to some
extent that I could change, and correlated with a dependent variable Y,
like health, to some extent that I could also change. I threw in an
additional predictor X3 that was also correlated with Y but uncorrelated
with X1 and X2, just to keep track of how that kind of variable behaved
in such an analysis. I won't say any more about that one, other
than it gave the right correlations in the multiple linear
regression.
You have two choices for interpreting the magnitude. You use either
the regression coefficients (the terms in the multiple linear regression
that convert values of the predictors into values of the dependent
variable) or correlation coefficients. It's hard to get a good idea
of magnitude from the regression coefficients without invoking Cohen's
concepts in some manner. In other words, the between-subject
standard deviation (variation) in the predictor and dependent have to
come into the story. Correlation coefficients already have
between-subject SDs built in, so they are good candidates for
interpreting magnitude. The correlation coefficient for a given
predictor in a multiple linear regression is called a partial correlation
coefficient, but you can calculate it in several ways. More about
that in a moment.
My simulations showed that the regression coefficients are bad measures
for gauging magnitude when there is substantial correlation between X1
and X2. I pushed things to the limit to get a clear picture by
making X1 and X2 exactly the same, apart from random
noise. The more noise I added, the worse the
correlation between the two. Because X1 and X2 were effectively the
same, they both had the same correlation with Y in reality, but of
course, in any sample their correlations with Y would differ a little
because of the noise. I found that the regression coefficients for
X1 and X2 on average were identical in the multiple linear regression, as
they would have to be. Further, they were half the value that
either had in the multiple regression when the other was not included in
the model. That makes sense too. But note that the value with
both in the model is not zero. It is half what either one has on
its own. Interpreting the regression coefficient for, say, X1
controlling for X2 would therefore be misleading, because it would give
you the impression that X1 had half its effect in the presence of
X2. But X1 and X2 are measuring the same thing, apart from noise,
so whatever measure of magnitude you use for X1 in the presence of X2, it
should be zero, not half the value when it's on its own. There is
also an issue about precision of the estimate of the regression
coefficients when you have strong collinearity, but that's not an issue
here. Others will disagree.
The partial correlation coefficients did much better, although they
weren't perfect. First, it was obvious I should use what SAS calls
a Type II (or simultaneous) partial, which means the partial correlation
when you have controlled for any confounding effects of the other
predictors. When I had little noise in the values of X1 and X2, and
therefore a high correlation between them, the partial between Y and X1
with X2 in the model was near enough to zero, and vice versa, which is
the right answer. Partial correlations beat regression
coefficients.
When there was more noise in the relationship between X1 and X2, and
therefore a lower correlation between them, the partial correlation for
one of them in the presence of the other started to creep up from
zero. The poorer the correlation between X1 and X2, the bigger the
partial for any one of them. This occurred even though X1 and X2
were measuring exactly the same thing in Y--I made sure of that in the
way I generated the variables. In other words, noise in a predictor
reduces your ability to fully control for its confounding effects in a
multiple linear regression. I knew all that from way back, but it
was good to be reminded. There is an important consequence for
interpretation of studies of population health: be skeptical about
reports that state things like "even after we controlled for SES,
diet etc etc, there was still an effect of activity on
health". It may be that there isn't actually any substantial
effect of activity on health in reality, once you control for those other
things, because lifestyle predictors are often noisy, so they don't
control properly.
But my original question, about which partial correlation to use, is
still unanswered! You can calculate the correlation as a
semi-partial, which means effectively that the magnitude of the effect is
interpreted using the SD of the original dependent variable. Or you
can calculate it as a (full) partial, which means you effectively use the
SD of the residual variation in the dependent variable after all the
other predictors have been taken into account. The partial will be
larger than the semi-partial, because the partial is the expected
correlation for subjects all with the same SES, diet etc etc. Is it
the right one to use? Dunno. It looks bigger, so if you want
a big correlation, use the partial, not the semi-partial? But if
you want your answer to be little or no effect, use the semi-partial, not
the partial? That's not good science.
A further important point is what I call the danger of throwing out the
baby (activity) with the bathwater (SES and the other predictors in the
model). In the example, controlling for SES left nothing for
activity. But that doesn't mean activity isn't important. It
just means that its effects go along with SES. So make sure you
look at the correlation between the dependent (health) and each predictor
(activity, SES...) in its own--the simple raw correlations, in other
words. If you find the correlation for activity is substantial on
its own but drops to near zero when you control for the other variables,
your conclusion is that activity could still be important, even though it
is accounted for by the other predictors. And if you find that it
is still important after you control for the other predictors, and some
of them are noisy, make sure you alert the reader to the possibility that
the controlling might not be that good.
The person for whom I initiated this enquiry also wanted confidence
limits for the partial correlation. Hmmm... Stats
packages probably won't give you that, but they will give you a p value
for the predictor controlled for all the other predictors. To
convert that p value and correlation coefficient into confidence limits,
convert the correlation to a Fisher z using the FISHER() function in
Excel. That statistic has a normal sampling distribution, so put it
and the p value into my spreadsheet for confidence limits, the bit at the
top that deals with normally distributed variables. Make the degrees of
freedom something large, like 1000. Then convert the confidence
limits back into correlation coefficients using the FISHERINV() function
in Excel. Voila. You can do this for either the part or the
partial correlations. Naturally you then interpret the confidence
limits in relation to substantial values. Or use the chances of
benefit and harm in the confidence-limits spreadsheet. The Fisher z
transform of the smallest worthwhile correlation is effectively the same
as the smallest correlation, 0.1.
Will
Here are the replies I got, edited a little. I disagree with most
of the points in these replies, but I am not really certain about this
stuff.
From: Dave Hosmer
In your example there is probably an interaction (effect modification) so
adjustment (confounding ) is not relevant. You can see Hosmer and
Lemeshow, Applied Logistic Regression, Second Edition for a discussion of
this in the context of a logistic regression.
As for coefficients the model must be fit on data that has been
standardized (data - mean)/SD)) to be able to compare coefficients.
The regression coefficient provides an estimate of effect holding all
other model covariates constant, unless there is an interaction.
Personally I hate correlation's of any type and use coefficient based
estimates of effect.
From: Stanley Lemeshow
Perhaps these few pages of overheads I use in my class will be helpful to
you. The program I used in the notes is Stata. A good
reference is the book Applied Regression Analysis and Multivariable
Methods - 3rd Edition by Kleinbaum, Kupper, Muller and Nizam
[From what Stan had in his overheads, it looks like he uses the partial,
not the semi-partial. He gave no rationale, however.]
From: Ian Shrier <ian.shrier@...>
Okay, I have an answer from a senior epidemiologist who is very well
respected at our institution. I will later get a statistician's
viewpoint.
In epidemiology, we focus on the coefficients and the confidence
intervals for the coefficients. The fitting of the overall model is
important, but not the partial correlation coefficients.
From an epidemiological perspective, correlation coefficients are not
useful. First, in a regression model, X predicts Y. Correlation
coefficients are based on the premise that there is a correlation between
the two but not that one predicts the other. Second, the concept that
correlation coefficients explain the variance is only internal. As
samples vary in study to study, the "explaining" would vary.
More importantly, the total variance would vary from sample to sample and
therefore the amount of explaining is not very helpful. In other words,
the correlation coefficients are not transferable from sample to sample
but are helpful in model selection because that is internal to the one
study.
I will try to see if the statisticians agree.
[later]...The statistician I spoke to deals mostly with model selection.
And he is a Bayesian, so he uses Bayesian techniqes. He did say that the
partial correlation coefficients will probably only be meaningful if
every variable is normalized so that the variances are normalized, if I
understood him correctly... ...Conceptually, I agree with the
epidemiologist approach. look at the coefficients and confidence
intervals. which variables belong in the model (i.e. model selection)
must be much more than numbers and include causal pathways and an
understanding of mechanisms. overall model fitting is important, but your
way of looking at partial correlations do not seem to help me much.