Date: Mon, 12 Sep 2005 17:15:03 -0300
Reply-To: Hector Maletta <firstname.lastname@example.org>
Sender: "SPSSX(r) Discussion" <SPSSX-L@LISTSERV.UGA.EDU>
From: Hector Maletta <email@example.com>
Subject: Re: data transformation bibliografical sources
Content-Type: text/plain; charset="us-ascii"
I have written a short commentary below, on the relevant part of your
> I need to disagree with two comments from Hector. For much of
> social science there is no a priori meaningful scale so often
> transformed variables (if they are increasing
> transformations) may have as much or more legitimacy as the
I agree. No need to disagree with me on this. I do not understand what is an
"increasing transformation", but anyway.
>This is particularly true with income. How could
> Jorge have the same error in his calculated income as Bill
> Gates does in his income?
This is particularly true with income, but not for that reason. The reason
is that the economic behavior of people is strongly related to proportional
change in their income (or costs of living, or consumption, or investment,
or profits) and not so much to the absolute amount of change in those
variables (although the absolute amount has some real effects too).
It is true that in estimating individual incomes one is bound to make bigger
absolute errors with Gates than with Jorge's income (which we assume is
slightly meager than Bill's). But this would mean that regression methods
cannot be used to estimate absolute levels of income, because regression
assumes homokedasticity (same error of estimation at all levels of income).
Fortunately, this is not a great trouble, because most economic theories
about these issues (consumer, investor and producer behavior) are based on
PROPORTIONAL CHANGE, and not on ABSOLUTE CHANGE. For this reason, and for
this reason alone, logarithms should be used. Now, using logarithms has the
fortunate consequence of making errors in the logarithm to be not so
different at different levels of income. In particular, the same
proportional change translates into the same logarithmic difference,
regardless of the scale of income. Notice that this does not make absolute
changes or errors smaller. They are the same as before, but between 10,000
and 12,000 there is the same logarithmic difference than between 1,000,000
and 1,200,000, and so these two different differences (respectively $2,000
and $200,000) represent the same proportional difference of 20% in both
If we lived in a world where people do NOT react to proportional but to
absolute differences in income, we should not be guided by equations based
on the log of income, and if that precludes the use of regression, so be it.
> Errors and misestimates are obviously related to size, ergo
> the necessity of logging.
NON SEQUITUR. If you log for the pleasure of logging, your log errors would
be homogeneous along the IV, but that would not make your absolute error any
> Second, there isn't always the possibility of finding an
> abstruse mathematical formula (unless it's stochastic) to
> create normality. I have had students (albeit without much
> background in math) try to transform gender (M or F) into a
> normally distributed and symmetric variable. Square roots and
> logarithms didn't work! Neither did anything else.
Again, here the old confusionary hydra rises its head: the variable itself,
like gender or income, may not possibly have a normal distribution, but
their estimation error should be normal for parametric methods to be
applied. If you draw repeated samples of 500 people from a large population,
and measure the proportion of women, these proportions would vary from one
sample to the next, and these various sample estimates will have (usually,
if samples are random) a normal distribution around the true proportion of
women in the population. This would be more true the larger is the sample.
As for abstruse formulas: suppose your linear regression yields
heterokedastic errors. It is a well known fact that N points can be exactly
predicted by a polynomial of degree N-1 (two points by a straight line,
three by a quadratic equation, four by a cubic equation, and so on). So if
you use a transformation of degree N-1 you will have a regression line that
exactly passes through all your points, with zero error of estimation for
all cases. More precisely, if you predict Y by an N-1 polynomial in X, the
predicted and observed values will be the same for all the N points, with
R2=1 and standard error of estimate =0, with an infinitely narrow confidence
interval for your sample. If you assume, as is assumed in regression
analysis, that the independent variables are not random variables, then this
would also be true for ANY sample, not just for yours (but do not run to
your computer: all variables, even IV, are in fact random variables which
can vary from one sample to the next, and so much for regression's
Hope this helps, or at least keeps the discussion sparkling.