LISTSERV at the University of Georgia
Menubar Imagemap
Home Browse Manage Request Manuals Register
Previous messageNext messagePrevious in topicNext in topicPrevious by same authorNext by same authorPrevious page (September 2005, week 1)Back to main SAS-L pageJoin or leave SAS-L (or change settings)ReplyPost a new messageSearchProportional fontNon-proportional font
Date:         Sat, 3 Sep 2005 06:33:11 -0400
Reply-To:     Peter Flom <flom@NDRI.ORG>
Sender:       "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From:         Peter Flom <flom@NDRI.ORG>
Subject:      Re: On Hosmer-Lemeshow, etc. and Model Selection
Comments: To: davidlcassell@MSN.COM
Content-Type: text/plain; charset=US-ASCII

Talbot Michael Katz wrote

>Peter -- Not a guru? Howzabout we'll call you "Guru, Jr." for now. So,

And David Cassell replied <<< Peter tends to under-value his years of experience. >>>

Thank you both!

TMK:

>let's talk about categorizing continuous variables. While I nod my head >in vigorous agreement with all the drawbacks you point out, I am still an >enthusiastic advocate of discretization, and I'm surprised that you, as a >logistic modeler, are not. I think of discretization as the logistic >modeler's secret weapon (when used wisely, of course). What it comes down >to is finding non-monotone response. Let's look at the example that you >gave about age grouping and heart attacks. I don't know the data at all, >but I'm guessing that the probability of having a heart attack pretty much >increases with age, you know, wear and tear and all that. But, perhaps >there are other forces at work... people with weak hearts will die earlier >and stronger people will live longer, so their probability may decrease >after a certain age. Suppose the percentage of heart attacks in the 45-54 >population is 10%, and then it goes up to 20% in the 55-64 group and back >down to 10% in the 65-74. Well, then, untransformed age might not show up >as predictive for heart attacks in a logistic model, but put it into these >bins, and bingo! Okay, you might contend that trees are better for non- >monotone response capture, but even there you can increase your chances >with well-placed binning (of course, you have to be very careful to avoid >overfitting). This is something I've spent a lot of time thinking about >and looking at, and I'm curious about your views (and anyone else's).

DC: <<< I'm going to side with Peter here.

You make a really good point. But I don't think binning is the right answer. Well, it may be the right answer some of the time, but not all the time. I think that [1] EDA (exploratory data analysis) and [2] appropriate selection of a meaningful data transform is the better answer.

Sometimes, non-linear effects totally muck up linear estimation approaches.. because people aren't doing their homework first! Some EDA to start with should show the pattern you lay out above, which would then lead to a more meaningful model, as well as more relevant hypotheses for future studies.

And there's always PROC TRANSREG. Spline curves, anyone? Just kidding. While splines and other interesting transforms have been used to great effect, they may also *reduce* the interpretability of the resulting model. >>>

While I certainly agree that spline curve make interpretation difficult, might they not be good in Bora's case, where interpretability is less an issue?

I brought up the heart attack example because 1) Bora has not said what his variables are so I have trouble talking about them :-) (Although Bora DID give an admirable amount of details as to what he was doing, thank you Bora) 2) It seemed like an example that would be readily understandable. But in the case of heart attacks, we presumably ARE interested in interpretability

Even there, though, Frank Harrell, in his Regression Modelling Strategies, makes a case for using them.......I am not totally sold, but they seem worth a look. Me, I usually have to explain what I do to an audience of people who are experts in other fields (public health, epidemiology, AIDS, drug abuse) but not stats, and so haven't used splines in my own work. (I usually have to understand something before I can explain it :-) :-)

Peter

Peter L. Flom, PhD Assistant Director, Statistics and Data Analysis Core Center for Drug Use and HIV Research National Development and Research Institutes 71 W. 23rd St www.peterflom.com New York, NY 10010 (212) 845-4485 (voice) (917) 438-0894 (fax)


Back to: Top of message | Previous page | Main SAS-L page