Date: Sat, 3 Sep 2005 06:33:11 -0400
Reply-To: Peter Flom <flom@NDRI.ORG>
Sender: "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From: Peter Flom <flom@NDRI.ORG>
Subject: Re: On Hosmer-Lemeshow, etc. and Model Selection
Content-Type: text/plain; charset=US-ASCII
Talbot Michael Katz wrote
>Peter -- Not a guru? Howzabout we'll call you "Guru, Jr." for now.
So,
And David Cassell replied
<<<
Peter tends to under-value his years of experience.
>>>
Thank you both!
TMK:
>let's talk about categorizing continuous variables. While I nod my
head
>in vigorous agreement with all the drawbacks you point out, I am still
an
>enthusiastic advocate of discretization, and I'm surprised that you, as
a
>logistic modeler, are not. I think of discretization as the logistic
>modeler's secret weapon (when used wisely, of course). What it comes
down
>to is finding non-monotone response. Let's look at the example that
you
>gave about age grouping and heart attacks. I don't know the data at
all,
>but I'm guessing that the probability of having a heart attack pretty
much
>increases with age, you know, wear and tear and all that. But, perhaps
>there are other forces at work... people with weak hearts will die
earlier
>and stronger people will live longer, so their probability may decrease
>after a certain age. Suppose the percentage of heart attacks in the
45-54
>population is 10%, and then it goes up to 20% in the 55-64 group and
back
>down to 10% in the 65-74. Well, then, untransformed age might not show
up
>as predictive for heart attacks in a logistic model, but put it into
these
>bins, and bingo! Okay, you might contend that trees are better for
non-
>monotone response capture, but even there you can increase your chances
>with well-placed binning (of course, you have to be very careful to
avoid
>overfitting). This is something I've spent a lot of time thinking
about
>and looking at, and I'm curious about your views (and anyone else's).
DC:
<<<
I'm going to side with Peter here.
You make a really good point. But I don't think binning is the right
answer. Well, it may be the right answer some of the time, but not
all the time. I think that [1] EDA (exploratory data analysis) and [2]
appropriate selection of a meaningful data transform is the better
answer.
Sometimes, non-linear effects totally muck up linear estimation
approaches.. because people aren't doing their homework first!
Some EDA to start with should show the pattern you lay out above,
which would then lead to a more meaningful model, as well as
more relevant hypotheses for future studies.
And there's always PROC TRANSREG. Spline curves, anyone?
Just kidding. While splines and other interesting transforms have
been used to great effect, they may also *reduce* the
interpretability of the resulting model.
>>>
While I certainly agree that spline curve make interpretation difficult,
might they not be good in Bora's case, where interpretability is less an
issue?
I brought up the heart attack example because 1) Bora has not said what
his variables are so I have trouble talking about them :-) (Although
Bora DID give
an admirable amount of details as to what he was doing, thank you Bora)
2) It seemed
like an example that would be readily understandable. But in the case
of heart attacks,
we presumably ARE interested in interpretability
Even there, though, Frank Harrell, in his Regression Modelling
Strategies, makes a
case for using them.......I am not totally sold, but they seem worth a
look. Me, I usually
have to explain what I do to an audience of people who are experts in
other fields (public health,
epidemiology, AIDS, drug abuse) but not stats, and so haven't used
splines in my own work. (I usually have
to understand something before I can explain it :-) :-)
Peter
Peter L. Flom, PhD
Assistant Director, Statistics and Data Analysis Core
Center for Drug Use and HIV Research
National Development and Research Institutes
71 W. 23rd St
www.peterflom.com
New York, NY 10010
(212) 845-4485 (voice)
(917) 438-0894 (fax)
|