Date: Thu, 29 Jun 2006 10:29:51 -0700
Reply-To: Daqing Zhao <dlouiszhao@GMAIL.COM>
Sender: "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From: Daqing Zhao <dlouiszhao@GMAIL.COM>
Subject: Re: Variable or model selection methods
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Thanks for the message.
I take that there are cases where domain expert or content specialist knows
what the important drivers are. If you are trying to predict the trajectory
of some planets, you don't want to include factors other than the force
field, time and initial conditions.
There are cases where you don't know what drives the target variable, such
as the stock price of some company or cause of some cancer. You try to find
predictors and that's part of the game.
I know someone who was anal about Markov Blanket, which to me is a defintion
rather than a recipe.
On 6/28/06, Sigurd Hermansen <HERMANS1@westat.com> wrote:
> To paraphrase our resident scourge of all things stepwise, 'All stepwise
> methods are wrong. Other automatic model selection methods are wrong,
> too, but not as bad as stepwise'.
> No, that's not a paraphrase. It's a summary.
> Why does he disparage step-wise methods? Perhaps he believes that
> content specialists should know more than a computer chip about what
> determines what. He may also know that models selected stepwise don't
> hold up well when applied to samples other than the samples used to
> estimate them.
> Given the current state of the art, I prefer stochastic gradient
> boosting (say, TreeNet) as an exploratory tool, though content
> specialists should agree on almost all predictors in a model. The LASSO
> and other regularization methods may help deal with collinear
> predictors, but outliers and leverage points in a sample may still leave
> you with a bad model. For now the SAS-L Archives have enough postings on
> model selection to keep you occupied for the next few months....
> -----Original Message-----
> From: firstname.lastname@example.org [mailto:email@example.com]
> On Behalf Of Daqing Zhao
> Sent: Wednesday, June 28, 2006 2:29 PM
> To: SAS-L@listserv.uga.edu
> Subject: Variable or model selection methods
> Hi All,
> I often need to select a limited number of important variables from a
> large set for prediction and often wonder what the best methodology is.
> Of course different people say different ones are the best. I have all
> kinds of variables, categorical, binary, numeric, ordinal and many of
> them correlated and sparse.
> Can someone recommend a good method for doing that?
> Some say proc logistic stepwise is bad. How about CART gini index
> reduction, Lasso, leave one variable out, and there are also mutual
> information, Markov blanket, and others? Comments on accuracy,
> robustness (for type of variables, missing data, outliers, etc), and
> efficiency (need
> googleplex) ?