Date: Fri, 21 Mar 2008 17:08:20 -0400
Reply-To: Sigurd Hermansen <HERMANS1@WESTAT.COM>
Sender: "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From: Sigurd Hermansen <HERMANS1@WESTAT.COM>
Subject: Re: PROC GLMSELECT?
Content-Type: text/plain; charset="us-ascii"
Once you exclude a critical variable using any specification search
method, verifying or validating using a parametric modelling method may
hint at gaps in a model (such as elevated residuals for observations,
for example, across specific regions of genes), but evaluations of
alternative models generally focus on a limited set of covariates. I'd
supplement GLMSELECT with a classification tree (CART, CHAID, TreeNet)
program during the specification search phase and take a close look at
prediction errors (concordance c statistic) and their distributions
across bootstrap samples. An omitted covariate that appears toward the
top of a decision tree or a somewhat low c statistic (<0.8) could mean
that GLMSELECT or whatever one is using to reduce model complexity has
excluded something important.
How you reduce or add model complexity also matters. Simply adding dummy
variables to a model to minimize prediction errors, for instance, will
lead to a model that fits a sample well but has far less predictive or
explanatory value than model diagnostics suggest. Finding surrogates for
dummy restrictions using step-wise methods will do much the same.
Control of the False Discovery Rate (FDR) requires that one take into
account multiple comparisons of covariates to outcomes, whether through
step-wise selection or other methods.
From: firstname.lastname@example.org [mailto:email@example.com]
On Behalf Of Mary
Sent: Friday, March 21, 2008 4:12 PM
To: Sigurd Hermansen; SAS-L@LISTSERV.UGA.EDU
Subject: Re: Re: PROC GLMSELECT?
I went out and looked at some of the articles Peter Flom mentioned, and
while they are very mathemematical, Roth's article seemed to indicate
that a LASSO method could help in genetic analysis where we have
thousands of gene combinations and very few events (people who get the
disease or not).
The biggest problem I'm having is in gene reduction- I can run single
variables of my gene observation as the independent variable against the
dependent variable of whether they get the disease or not, and look at
the ones that have the best odds ratios and likelihoods, but then I'm
still down to at least 60 variables that are good possibilities in my
model out of the original thousands.
Reducing it further has been a problem, since apparently DNA repeats
itself in places, and thus one variable that is highly significant may
not provide additional information once another variable or variables is
already in the model.
Would GLMSELECT be an appropriate approach to model building with my
binary outcome (has the disease or not), provided it was followed by
verifying the model in PROC Logistic?
An example would be appreciated.
Thanks very much for the discussion - it is very interesting, and
perhaps very important for building genetic models.
----- Original Message -----
From: Sigurd Hermansen
Sent: Friday, March 21, 2008 2:34 PM
Subject: Re: PROC GLMSELECT?
I understand the concern about extensions of GLMSELECT to generalized
linear models. The heuristic justification for using GLMSELECT evolves
1) context knowledge and theory should in any event guide
2) maximizing any single criterion function can be misleading;
3) tests of using GLMSELECT to select a priori related predictors or
reject unrelated predictors show some promise;
4) elimination of unrelated parameters that might be selected by other
step-wise methods has a benefit;
5) other methods (classification trees, graphics) can be used to
for important predictors that GLMSELECT might miss;
6) GLMSELECT does not conduct a test of a hypothesis about a model,
should not be interpreted as such;
7) exploratory methods do not depend as critically on distribution and
On Behalf Of Wensui Liu
Sent: Friday, March 21, 2008 2:35 PM
To: Peter Flom
Subject: Re: PROC GLMSELECT?
first of all, idea of lasso is similar to ridge regresion and is based
upon the shrinkage process of estimated coefficients by subpressing
unimportant ones to zeroes. as we all know, the link functions for OLS
reg and Logit reg are different. identity is a linear link, while
is not, meaning the ways to estimate coefficients are different in two
types of models. if coefficients from OLS reg are not comparable to
from Logit reg, how would their shrinkage process be comparable and
On Fri, Mar 21, 2008 at 2:13 PM, Peter Flom
> Wensui Liu <firstname.lastname@example.org> wrote
> >well, peter,
> >the references listed are all referring to using lasso in
> generalized >linear models. however, they never mentioned anything
> about using proc >glmselect in generalized linear models. these are
> totally different >concepts. per my limited understanding about
> glmselect, the GLM here >means general linear models instead of
> generalized linear models. >please correct me if i am wrong. >thx.
> Well, not *totally* different concepts.
> You are right that GLMSELECT is designed to work with continuous
> But I don't see anything that STOPS it from working with binayr DVs,
> or survival times.
> GLMSELECT, after all, is also not intended to be a final step in
> analysis: It's designed to be a variable selection tool, to be
> followed up with use of PROC GLM or PROC REG. I see no reason why
> can't be used as a variable selection tool, only followed with PROC
> LOGISTIC or PROC PHREG; PROC MIXED and similar would be more
> I think. The reason I think this is because of what assumptions are
> violated by each technique.
> Binary and survival DVs violate assumptions about the distribution
> the residuals. Mixed models violate assumptions about their
> independence. That seems to me to be a much more complex problem.
> Statistical Consultant
> www DOT peterflom DOT com
ChoicePoint Precision Marketing
Email : email@example.com
Blog : statcompute.spaces.live.com ===============================