LISTSERV at the University of Georgia
Menubar Imagemap
Home Browse Manage Request Manuals Register
Previous (more recent) messageNext (less recent) messagePrevious (more recent) in topicNext (less recent) in topicPrevious (more recent) by same authorNext (less recent) by same authorPrevious page (March 2008, week 3)Back to main SAS-L pageJoin or leave SAS-L (or change settings)ReplyPost a new messageSearchProportional fontNon-proportional font
Date:         Fri, 21 Mar 2008 17:08:20 -0400
Reply-To:     Sigurd Hermansen <HERMANS1@WESTAT.COM>
Sender:       "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From:         Sigurd Hermansen <HERMANS1@WESTAT.COM>
Subject:      Re: PROC GLMSELECT?
Comments: To: Mary <mlhoward@avalon.net>
In-Reply-To:  <01ba01c88b8f$e316de10$c12fa8c0@HP82083701405>
Content-Type: text/plain; charset="us-ascii"

Mary: Once you exclude a critical variable using any specification search method, verifying or validating using a parametric modelling method may hint at gaps in a model (such as elevated residuals for observations, for example, across specific regions of genes), but evaluations of alternative models generally focus on a limited set of covariates. I'd supplement GLMSELECT with a classification tree (CART, CHAID, TreeNet) program during the specification search phase and take a close look at prediction errors (concordance c statistic) and their distributions across bootstrap samples. An omitted covariate that appears toward the top of a decision tree or a somewhat low c statistic (<0.8) could mean that GLMSELECT or whatever one is using to reduce model complexity has excluded something important.

How you reduce or add model complexity also matters. Simply adding dummy variables to a model to minimize prediction errors, for instance, will lead to a model that fits a sample well but has far less predictive or explanatory value than model diagnostics suggest. Finding surrogates for dummy restrictions using step-wise methods will do much the same. Control of the False Discovery Rate (FDR) requires that one take into account multiple comparisons of covariates to outcomes, whether through step-wise selection or other methods. S

-----Original Message----- From: owner-sas-l@listserv.uga.edu [mailto:owner-sas-l@listserv.uga.edu] On Behalf Of Mary Sent: Friday, March 21, 2008 4:12 PM To: Sigurd Hermansen; SAS-L@LISTSERV.UGA.EDU Subject: Re: Re: PROC GLMSELECT?

I went out and looked at some of the articles Peter Flom mentioned, and while they are very mathemematical, Roth's article seemed to indicate that a LASSO method could help in genetic analysis where we have thousands of gene combinations and very few events (people who get the disease or not).

The biggest problem I'm having is in gene reduction- I can run single variables of my gene observation as the independent variable against the dependent variable of whether they get the disease or not, and look at the ones that have the best odds ratios and likelihoods, but then I'm still down to at least 60 variables that are good possibilities in my model out of the original thousands.

Reducing it further has been a problem, since apparently DNA repeats itself in places, and thus one variable that is highly significant may not provide additional information once another variable or variables is already in the model.

Would GLMSELECT be an appropriate approach to model building with my binary outcome (has the disease or not), provided it was followed by verifying the model in PROC Logistic?

An example would be appreciated.

Thanks very much for the discussion - it is very interesting, and perhaps very important for building genetic models.

-Mary ----- Original Message ----- From: Sigurd Hermansen To: SAS-L@LISTSERV.UGA.EDU Sent: Friday, March 21, 2008 2:34 PM Subject: Re: PROC GLMSELECT?

I understand the concern about extensions of GLMSELECT to generalized linear models. The heuristic justification for using GLMSELECT evolves this way: 1) context knowledge and theory should in any event guide specification searches; 2) maximizing any single criterion function can be misleading; 3) tests of using GLMSELECT to select a priori related predictors or to reject unrelated predictors show some promise; 4) elimination of unrelated parameters that might be selected by other step-wise methods has a benefit; 5) other methods (classification trees, graphics) can be used to search for important predictors that GLMSELECT might miss; 6) GLMSELECT does not conduct a test of a hypothesis about a model, and should not be interpreted as such; 7) exploratory methods do not depend as critically on distribution and continuity assumptions. S

-----Original Message----- From: owner-sas-l@listserv.uga.edu [mailto:owner-sas-l@listserv.uga.edu] On Behalf Of Wensui Liu Sent: Friday, March 21, 2008 2:35 PM To: Peter Flom Cc: SAS-L@listserv.uga.edu Subject: Re: PROC GLMSELECT?

first of all, idea of lasso is similar to ridge regresion and is based upon the shrinkage process of estimated coefficients by subpressing the unimportant ones to zeroes. as we all know, the link functions for OLS reg and Logit reg are different. identity is a linear link, while logit is not, meaning the ways to estimate coefficients are different in two types of models. if coefficients from OLS reg are not comparable to ones from Logit reg, how would their shrinkage process be comparable and exchangeable?

On Fri, Mar 21, 2008 at 2:13 PM, Peter Flom <peterflomconsulting@mindspring.com> wrote: > Wensui Liu <liuwensui@gmail.com> wrote > > >well, peter, > >the references listed are all referring to using lasso in > generalized >linear models. however, they never mentioned anything > about using proc >glmselect in generalized linear models. these are 2

> totally different >concepts. per my limited understanding about > glmselect, the GLM here >means general linear models instead of > generalized linear models. >please correct me if i am wrong. >thx. > > > > Well, not *totally* different concepts. > > You are right that GLMSELECT is designed to work with continuous DVs.

> But I don't see anything that STOPS it from working with binayr DVs, > or survival times. > > GLMSELECT, after all, is also not intended to be a final step in any > analysis: It's designed to be a variable selection tool, to be > followed up with use of PROC GLM or PROC REG. I see no reason why it > can't be used as a variable selection tool, only followed with PROC > LOGISTIC or PROC PHREG; PROC MIXED and similar would be more complex,

> I think. The reason I think this is because of what assumptions are > violated by each technique. > > Binary and survival DVs violate assumptions about the distribution of

> the residuals. Mixed models violate assumptions about their > independence. That seems to me to be a much more complex problem. > > Peter > > Statistical Consultant > www DOT peterflom DOT com >

-- =============================== WenSui Liu ChoicePoint Precision Marketing Phone: 678-893-9457 Email : wensui.liu@choicepoint.com Blog : statcompute.spaces.live.com ===============================


Back to: Top of message | Previous page | Main SAS-L page