LISTSERV at the University of Georgia
Menubar Imagemap
Home Browse Manage Request Manuals Register
Previous messageNext messagePrevious in topicNext in topicPrevious by same authorNext by same authorPrevious page (September 2005)Back to main SPSSX-L pageJoin or leave SPSSX-L (or change settings)ReplyPost a new messageSearchProportional fontNon-proportional font
Date:         Mon, 12 Sep 2005 17:15:03 -0300
Reply-To:     Hector Maletta <hmaletta@fibertel.com.ar>
Sender:       "SPSSX(r) Discussion" <SPSSX-L@LISTSERV.UGA.EDU>
From:         Hector Maletta <hmaletta@fibertel.com.ar>
Subject:      Re: data transformation bibliografical sources
Comments: To: Derek Wilkinson <dwilkinson@laurentian.ca>
In-Reply-To:  <200509121859.j8CIxkWj032001@listserv.cc.uga.edu>
Content-Type: text/plain; charset="us-ascii"

Derek, I have written a short commentary below, on the relevant part of your message You wrote: > I need to disagree with two comments from Hector. For much of > social science there is no a priori meaningful scale so often > transformed variables (if they are increasing > transformations) may have as much or more legitimacy as the > original.

I agree. No need to disagree with me on this. I do not understand what is an "increasing transformation", but anyway.

>This is particularly true with income. How could > Jorge have the same error in his calculated income as Bill > Gates does in his income?

This is particularly true with income, but not for that reason. The reason is that the economic behavior of people is strongly related to proportional change in their income (or costs of living, or consumption, or investment, or profits) and not so much to the absolute amount of change in those variables (although the absolute amount has some real effects too). It is true that in estimating individual incomes one is bound to make bigger absolute errors with Gates than with Jorge's income (which we assume is slightly meager than Bill's). But this would mean that regression methods cannot be used to estimate absolute levels of income, because regression assumes homokedasticity (same error of estimation at all levels of income). Fortunately, this is not a great trouble, because most economic theories about these issues (consumer, investor and producer behavior) are based on PROPORTIONAL CHANGE, and not on ABSOLUTE CHANGE. For this reason, and for this reason alone, logarithms should be used. Now, using logarithms has the fortunate consequence of making errors in the logarithm to be not so different at different levels of income. In particular, the same proportional change translates into the same logarithmic difference, regardless of the scale of income. Notice that this does not make absolute changes or errors smaller. They are the same as before, but between 10,000 and 12,000 there is the same logarithmic difference than between 1,000,000 and 1,200,000, and so these two different differences (respectively $2,000 and $200,000) represent the same proportional difference of 20% in both cases. If we lived in a world where people do NOT react to proportional but to absolute differences in income, we should not be guided by equations based on the log of income, and if that precludes the use of regression, so be it.

> Errors and misestimates are obviously related to size, ergo > the necessity of logging.

NON SEQUITUR. If you log for the pleasure of logging, your log errors would be homogeneous along the IV, but that would not make your absolute error any more homogeneous.

> Second, there isn't always the possibility of finding an > abstruse mathematical formula (unless it's stochastic) to > create normality. I have had students (albeit without much > background in math) try to transform gender (M or F) into a > normally distributed and symmetric variable. Square roots and > logarithms didn't work! Neither did anything else.

Again, here the old confusionary hydra rises its head: the variable itself, like gender or income, may not possibly have a normal distribution, but their estimation error should be normal for parametric methods to be applied. If you draw repeated samples of 500 people from a large population, and measure the proportion of women, these proportions would vary from one sample to the next, and these various sample estimates will have (usually, if samples are random) a normal distribution around the true proportion of women in the population. This would be more true the larger is the sample.

As for abstruse formulas: suppose your linear regression yields heterokedastic errors. It is a well known fact that N points can be exactly predicted by a polynomial of degree N-1 (two points by a straight line, three by a quadratic equation, four by a cubic equation, and so on). So if you use a transformation of degree N-1 you will have a regression line that exactly passes through all your points, with zero error of estimation for all cases. More precisely, if you predict Y by an N-1 polynomial in X, the predicted and observed values will be the same for all the N points, with R2=1 and standard error of estimate =0, with an infinitely narrow confidence interval for your sample. If you assume, as is assumed in regression analysis, that the independent variables are not random variables, then this would also be true for ANY sample, not just for yours (but do not run to your computer: all variables, even IV, are in fact random variables which can vary from one sample to the next, and so much for regression's assumptions).

Hope this helps, or at least keeps the discussion sparkling.

Hector


Back to: Top of message | Previous page | Main SPSSX-L page