```Date: Mon, 3 Jul 2006 04:14:32 +0000 Reply-To: toby dunn Sender: "SAS(r) Discussion" From: toby dunn Subject: Re: Transformations of correlated X variables Comments: To: art297@NETSCAPE.NET In-Reply-To: <200607030331.k62Akn6a019160@malibu.cc.uga.edu> Content-Type: text/plain; format=flowed I remember way back when that there was a paper that discussed a multi miss specification test. It was grounded in the fact that there are many reasons why one miss specifies a model. And they can be interelated. Thus this guy came up with a way to more or less test a bunch of things at the same time. Now if I could just remember what that paper was called. Toby Dunn From: Arthur Tabachneck Reply-To: Arthur Tabachneck To: SAS-L@LISTSERV.UGA.EDU Subject: Re: Transformations of correlated X variables Date: Sun, 2 Jul 2006 23:31:43 -0400 Paul, I'll leave the discussion of the merits of your hypothesis to the list's stat experts. However, I do recall being taught that one of the assumptions which must be met in using linear regression methods was that the residuals must be normally distributed and, if one has a reason to believe that a particular variable inherently has a non-normal distribution (e.g., log or square root), then a normal distribution of the errors might be obtained by transforming the data accordingly. With insurance data I find that to be the case when modelling such variables as claim frequency and severity. Art --------- On Sun, 2 Jul 2006 22:00:17 -0400, Paul Walker wrote: >When doing preditive modeling using multiple regression (say, linear or >logistic) it is common practice to transform the X variables. Suppose >your model is Y = b0 + b1*X1 + b2*X2 + b3*X3 + ...+ bk*Xk + error. To >improve the "fit" of the model, sometimes modelers try and transform some >of the X1, X2, ..., Xk variables by looking at the univariate plots and >trying to find transformations that create a linear relationship between >the X variable and the Y variable (or in the case of logistic regression >linear in the logodds). In either case, my experience tells me that this >practice is flawed for the following reason. We may observe that there is >a nonlinear relationship between X1 and Y on a univariate level, but this >can be entirely explained because of the relationship between X1 and one >of the other X variables, say X2. There may be a non-linear relationship >between X1 and X2 that is causing the observed pattern between X1 and Y. >I have found that even when I fit a main effects model, I can usually >capture nonlinear trends in my X variables. I can tell this because I >project both the scores and the actual values onto the levels of each X >variable, i.e. X1, X2, ..., Xk, and show that the scores and the actual >values are very close. > >So, my main conclusion is that it doesn't make any sense to transform >variables by looking at univariate plots or maximizing some univariate >statistic between X and Y, such as R-square. Could anyone on the list >comment on this conclusion? Does it make sense? I would like to prove >this result analytically or by simulation and would like some suggestions >how to approach it. > >My solution to this problem is to use principal components regression. >Principal components are by definition uncorrelated [actually indpendent] >so making transformations based on univariate plots or statistics makes >sense. The component scores could be transformed to create a more linear >relationship with the target variable. Call the principal component >scores Z1, Z2, ..., Zk. Now the model fit might look like this: > Y = b0 + b1*f(Z1) + b2*f(Z2) + b3*f(Z3) + ... + b3*f(Zk) + error ```

Back to: Top of message | Previous page | Main SAS-L page