Date: Mon, 3 Jul 2006 04:14:32 +0000
Reply-To: toby dunn <tobydunn@HOTMAIL.COM>
Sender: "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From: toby dunn <tobydunn@HOTMAIL.COM>
Subject: Re: Transformations of correlated X variables
In-Reply-To: <200607030331.k62Akn6a019160@malibu.cc.uga.edu>
Content-Type: text/plain; format=flowed
I remember way back when that there was a paper that discussed a multi miss
specification test. It was grounded in the fact that there are many reasons
why one miss specifies a model. And they can be interelated. Thus this guy
came up with a way to more or less test a bunch of things at the same time.
Now if I could just remember what that paper was called.
Toby Dunn
From: Arthur Tabachneck <art297@NETSCAPE.NET>
Reply-To: Arthur Tabachneck <art297@NETSCAPE.NET>
To: SAS-L@LISTSERV.UGA.EDU
Subject: Re: Transformations of correlated X variables
Date: Sun, 2 Jul 2006 23:31:43 -0400
Paul,
I'll leave the discussion of the merits of your hypothesis to the list's
stat experts.
However, I do recall being taught that one of the assumptions which must
be met in using linear regression methods was that the residuals must be
normally distributed and, if one has a reason to believe that a particular
variable inherently has a non-normal distribution (e.g., log or square
root), then a normal distribution of the errors might be obtained by
transforming the data accordingly.
With insurance data I find that to be the case when modelling such
variables as claim frequency and severity.
Art
---------
On Sun, 2 Jul 2006 22:00:17 -0400, Paul Walker <walker.627@OSU.EDU> wrote:
>When doing preditive modeling using multiple regression (say, linear or
>logistic) it is common practice to transform the X variables. Suppose
>your model is Y = b0 + b1*X1 + b2*X2 + b3*X3 + ...+ bk*Xk + error. To
>improve the "fit" of the model, sometimes modelers try and transform some
>of the X1, X2, ..., Xk variables by looking at the univariate plots and
>trying to find transformations that create a linear relationship between
>the X variable and the Y variable (or in the case of logistic regression
>linear in the logodds). In either case, my experience tells me that this
>practice is flawed for the following reason. We may observe that there is
>a nonlinear relationship between X1 and Y on a univariate level, but this
>can be entirely explained because of the relationship between X1 and one
>of the other X variables, say X2. There may be a non-linear relationship
>between X1 and X2 that is causing the observed pattern between X1 and Y.
>I have found that even when I fit a main effects model, I can usually
>capture nonlinear trends in my X variables. I can tell this because I
>project both the scores and the actual values onto the levels of each X
>variable, i.e. X1, X2, ..., Xk, and show that the scores and the actual
>values are very close.
>
>So, my main conclusion is that it doesn't make any sense to transform
>variables by looking at univariate plots or maximizing some univariate
>statistic between X and Y, such as R-square. Could anyone on the list
>comment on this conclusion? Does it make sense? I would like to prove
>this result analytically or by simulation and would like some suggestions
>how to approach it.
>
>My solution to this problem is to use principal components regression.
>Principal components are by definition uncorrelated [actually indpendent]
>so making transformations based on univariate plots or statistics makes
>sense. The component scores could be transformed to create a more linear
>relationship with the target variable. Call the principal component
>scores Z1, Z2, ..., Zk. Now the model fit might look like this:
> Y = b0 + b1*f(Z1) + b2*f(Z2) + b3*f(Z3) + ... + b3*f(Zk) + error
|