LISTSERV at the University of Georgia
Menubar Imagemap
Home Browse Manage Request Manuals Register
Previous messageNext messagePrevious in topicNext in topicPrevious by same authorNext by same authorPrevious page (July 2006, week 1)Back to main SAS-L pageJoin or leave SAS-L (or change settings)ReplyPost a new messageSearchProportional fontNon-proportional font
Date:         Mon, 3 Jul 2006 04:14:32 +0000
Reply-To:     toby dunn <tobydunn@HOTMAIL.COM>
Sender:       "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From:         toby dunn <tobydunn@HOTMAIL.COM>
Subject:      Re: Transformations of correlated X variables
Comments: To: art297@NETSCAPE.NET
In-Reply-To:  <200607030331.k62Akn6a019160@malibu.cc.uga.edu>
Content-Type: text/plain; format=flowed

I remember way back when that there was a paper that discussed a multi miss specification test. It was grounded in the fact that there are many reasons why one miss specifies a model. And they can be interelated. Thus this guy came up with a way to more or less test a bunch of things at the same time. Now if I could just remember what that paper was called.

Toby Dunn

From: Arthur Tabachneck <art297@NETSCAPE.NET> Reply-To: Arthur Tabachneck <art297@NETSCAPE.NET> To: SAS-L@LISTSERV.UGA.EDU Subject: Re: Transformations of correlated X variables Date: Sun, 2 Jul 2006 23:31:43 -0400

Paul,

I'll leave the discussion of the merits of your hypothesis to the list's stat experts.

However, I do recall being taught that one of the assumptions which must be met in using linear regression methods was that the residuals must be normally distributed and, if one has a reason to believe that a particular variable inherently has a non-normal distribution (e.g., log or square root), then a normal distribution of the errors might be obtained by transforming the data accordingly.

With insurance data I find that to be the case when modelling such variables as claim frequency and severity.

Art --------- On Sun, 2 Jul 2006 22:00:17 -0400, Paul Walker <walker.627@OSU.EDU> wrote:

>When doing preditive modeling using multiple regression (say, linear or >logistic) it is common practice to transform the X variables. Suppose >your model is Y = b0 + b1*X1 + b2*X2 + b3*X3 + ...+ bk*Xk + error. To >improve the "fit" of the model, sometimes modelers try and transform some >of the X1, X2, ..., Xk variables by looking at the univariate plots and >trying to find transformations that create a linear relationship between >the X variable and the Y variable (or in the case of logistic regression >linear in the logodds). In either case, my experience tells me that this >practice is flawed for the following reason. We may observe that there is >a nonlinear relationship between X1 and Y on a univariate level, but this >can be entirely explained because of the relationship between X1 and one >of the other X variables, say X2. There may be a non-linear relationship >between X1 and X2 that is causing the observed pattern between X1 and Y. >I have found that even when I fit a main effects model, I can usually >capture nonlinear trends in my X variables. I can tell this because I >project both the scores and the actual values onto the levels of each X >variable, i.e. X1, X2, ..., Xk, and show that the scores and the actual >values are very close. > >So, my main conclusion is that it doesn't make any sense to transform >variables by looking at univariate plots or maximizing some univariate >statistic between X and Y, such as R-square. Could anyone on the list >comment on this conclusion? Does it make sense? I would like to prove >this result analytically or by simulation and would like some suggestions >how to approach it. > >My solution to this problem is to use principal components regression. >Principal components are by definition uncorrelated [actually indpendent] >so making transformations based on univariate plots or statistics makes >sense. The component scores could be transformed to create a more linear >relationship with the target variable. Call the principal component >scores Z1, Z2, ..., Zk. Now the model fit might look like this: > Y = b0 + b1*f(Z1) + b2*f(Z2) + b3*f(Z3) + ... + b3*f(Zk) + error


Back to: Top of message | Previous page | Main SAS-L page