Date: Fri, 21 Mar 2008 15:12:20 -0500
Reply-To: Mary <mlhoward@avalon.net>
Sender: "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From: Mary <mlhoward@AVALON.NET>
Subject: Re: PROC GLMSELECT?
Content-Type: text/plain; charset="iso-8859-1"
I went out and looked at some of the articles Peter Flom mentioned, and while they are very mathemematical, Roth's article seemed to indicate that a LASSO method could help in genetic analysis where we have thousands of gene combinations and very few events (people who get the disease or not).
The biggest problem I'm having is in gene reduction- I can run single variables of my gene observation as the independent variable against the dependent variable of whether they get the disease or not, and look at the ones that have the best odds ratios and likelihoods, but then I'm still down to at least 60 variables that are good possibilities in my model out of the original thousands.
Reducing it further has been a problem, since apparently DNA repeats itself in places, and thus one variable that is highly significant may not provide additional information once another variable or variables is already in the model.
Would GLMSELECT be an appropriate approach to model building with my binary outcome (has the disease or not), provided it was followed by verifying the model in PROC Logistic?
An example would be appreciated.
Thanks very much for the discussion - it is very interesting, and perhaps very important for building genetic models.
-Mary
----- Original Message -----
From: Sigurd Hermansen
To: SAS-L@LISTSERV.UGA.EDU
Sent: Friday, March 21, 2008 2:34 PM
Subject: Re: PROC GLMSELECT?
I understand the concern about extensions of GLMSELECT to generalized
linear models. The heuristic justification for using GLMSELECT evolves
this way:
1) context knowledge and theory should in any event guide specification
searches;
2) maximizing any single criterion function can be misleading;
3) tests of using GLMSELECT to select a priori related predictors or to
reject unrelated predictors show some promise;
4) elimination of unrelated parameters that might be selected by other
step-wise methods has a benefit;
5) other methods (classification trees, graphics) can be used to search
for important predictors that GLMSELECT might miss;
6) GLMSELECT does not conduct a test of a hypothesis about a model, and
should not be interpreted as such;
7) exploratory methods do not depend as critically on distribution and
continuity assumptions.
S
-----Original Message-----
From: owner-sas-l@listserv.uga.edu [mailto:owner-sas-l@listserv.uga.edu]
On Behalf Of Wensui Liu
Sent: Friday, March 21, 2008 2:35 PM
To: Peter Flom
Cc: SAS-L@listserv.uga.edu
Subject: Re: PROC GLMSELECT?
first of all, idea of lasso is similar to ridge regresion and is based
upon the shrinkage process of estimated coefficients by subpressing the
unimportant ones to zeroes. as we all know, the link functions for OLS
reg and Logit reg are different. identity is a linear link, while logit
is not, meaning the ways to estimate coefficients are different in two
types of models. if coefficients from OLS reg are not comparable to ones
from Logit reg, how would their shrinkage process be comparable and
exchangeable?
On Fri, Mar 21, 2008 at 2:13 PM, Peter Flom
<peterflomconsulting@mindspring.com> wrote:
> Wensui Liu <liuwensui@gmail.com> wrote
>
> >well, peter,
> >the references listed are all referring to using lasso in
> generalized >linear models. however, they never mentioned anything
> about using proc >glmselect in generalized linear models. these are 2
> totally different >concepts. per my limited understanding about
> glmselect, the GLM here >means general linear models instead of
> generalized linear models. >please correct me if i am wrong. >thx.
> >
>
> Well, not *totally* different concepts.
>
> You are right that GLMSELECT is designed to work with continuous DVs.
> But I don't see anything that STOPS it from working with binayr DVs,
> or survival times.
>
> GLMSELECT, after all, is also not intended to be a final step in any
> analysis: It's designed to be a variable selection tool, to be
> followed up with use of PROC GLM or PROC REG. I see no reason why it
> can't be used as a variable selection tool, only followed with PROC
> LOGISTIC or PROC PHREG; PROC MIXED and similar would be more complex,
> I think. The reason I think this is because of what assumptions are
> violated by each technique.
>
> Binary and survival DVs violate assumptions about the distribution of
> the residuals. Mixed models violate assumptions about their
> independence. That seems to me to be a much more complex problem.
>
> Peter
>
> Statistical Consultant
> www DOT peterflom DOT com
>
--
===============================
WenSui Liu
ChoicePoint Precision Marketing
Phone: 678-893-9457
Email : wensui.liu@choicepoint.com
Blog : statcompute.spaces.live.com ===============================