Date: Sat, 3 Sep 2005 06:33:11 -0400
Reply-To: Peter Flom <flom@NDRI.ORG>
Sender: "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From: Peter Flom <flom@NDRI.ORG>
Subject: Re: On Hosmer-Lemeshow, etc. and Model Selection
Content-Type: text/plain; charset=US-ASCII
Talbot Michael Katz wrote
>Peter -- Not a guru? Howzabout we'll call you "Guru, Jr." for now.
And David Cassell replied
Peter tends to under-value his years of experience.
Thank you both!
>let's talk about categorizing continuous variables. While I nod my
>in vigorous agreement with all the drawbacks you point out, I am still
>enthusiastic advocate of discretization, and I'm surprised that you, as
>logistic modeler, are not. I think of discretization as the logistic
>modeler's secret weapon (when used wisely, of course). What it comes
>to is finding non-monotone response. Let's look at the example that
>gave about age grouping and heart attacks. I don't know the data at
>but I'm guessing that the probability of having a heart attack pretty
>increases with age, you know, wear and tear and all that. But, perhaps
>there are other forces at work... people with weak hearts will die
>and stronger people will live longer, so their probability may decrease
>after a certain age. Suppose the percentage of heart attacks in the
>population is 10%, and then it goes up to 20% in the 55-64 group and
>down to 10% in the 65-74. Well, then, untransformed age might not show
>as predictive for heart attacks in a logistic model, but put it into
>bins, and bingo! Okay, you might contend that trees are better for
>monotone response capture, but even there you can increase your chances
>with well-placed binning (of course, you have to be very careful to
>overfitting). This is something I've spent a lot of time thinking
>and looking at, and I'm curious about your views (and anyone else's).
I'm going to side with Peter here.
You make a really good point. But I don't think binning is the right
answer. Well, it may be the right answer some of the time, but not
all the time. I think that  EDA (exploratory data analysis) and 
appropriate selection of a meaningful data transform is the better
Sometimes, non-linear effects totally muck up linear estimation
approaches.. because people aren't doing their homework first!
Some EDA to start with should show the pattern you lay out above,
which would then lead to a more meaningful model, as well as
more relevant hypotheses for future studies.
And there's always PROC TRANSREG. Spline curves, anyone?
Just kidding. While splines and other interesting transforms have
been used to great effect, they may also *reduce* the
interpretability of the resulting model.
While I certainly agree that spline curve make interpretation difficult,
might they not be good in Bora's case, where interpretability is less an
I brought up the heart attack example because 1) Bora has not said what
his variables are so I have trouble talking about them :-) (Although
Bora DID give
an admirable amount of details as to what he was doing, thank you Bora)
2) It seemed
like an example that would be readily understandable. But in the case
of heart attacks,
we presumably ARE interested in interpretability
Even there, though, Frank Harrell, in his Regression Modelling
Strategies, makes a
case for using them.......I am not totally sold, but they seem worth a
look. Me, I usually
have to explain what I do to an audience of people who are experts in
other fields (public health,
epidemiology, AIDS, drug abuse) but not stats, and so haven't used
splines in my own work. (I usually have
to understand something before I can explain it :-) :-)
Peter L. Flom, PhD
Assistant Director, Statistics and Data Analysis Core
Center for Drug Use and HIV Research
National Development and Research Institutes
71 W. 23rd St
New York, NY 10010
(212) 845-4485 (voice)
(917) 438-0894 (fax)