LISTSERV at the University of Georgia
Menubar Imagemap
Home Browse Manage Request Manuals Register
Previous messageNext messagePrevious in topicNext in topicPrevious by same authorNext by same authorPrevious page (July 2003)Back to main SPSSX-L pageJoin or leave SPSSX-L (or change settings)ReplyPost a new messageSearchProportional fontNon-proportional font
Date:   Fri, 25 Jul 2003 14:31:58 +1000
Reply-To:   paulandpen@optusnet.com.au
Sender:   "SPSSX(r) Discussion" <SPSSX-L@LISTSERV.UGA.EDU>
From:   Paul Dickson <paulandpen@optusnet.com.au>
Subject:   Re: What does "over-fitting" mean?
Comments:   To: Christina Cutshaw <ccutsha1@jhem.jhmi.edu>
Content-Type:   text/plain

Hi there Christina,

I agree with the comments made (and yes they were a little harsh in my opinion but unfortunately true for stats oriented journals and their readers). Readers with enough stats knowledge could hammer you for using this type of analysis with the sample size, provided they have time and could be bothered-I would not but some others out there might). I would therefore recommend deferring to descriptives and to revise the journal publication possibilities aiming for practitioner/non-stats based journals and go from there.

I would also look at other previous published research in the same area that has used any of the same variables that you have and compare your findings on a variable by variable basis (you don't have the sample size to develop a multivariate model from your data) and use tables (means, medians, proportions etc) of yours and other peoples results to comment on the findings (similarities/differences/possible reasons why they are different and similar and how this could contribute to the field as a whole- recommendations of follow up and future funding should also be factored in here). This broad information could help you provide useful clinical/theoretical information and make comparisons across studies that while not statistically meaningful are still incredibly meaningful to practitioners and to the theory in a broad sense. Also see if there are norms etc for your data and look to national data (incident/prevalence) that you could use to contextualise your findings a little more.

Finally, I am convinced there is plenty of crap published out there that makes little contribution to any real theory or clinical/practical meaning (no effect size reported) simply because the sample size used in the analysis was so big that even tiny differences were picked up statistically (if only we all had data sets like this). All significant stats really do at the end of the day is substantiate the analysis statistically (according to commonly agreed and valid rules of thumb- which vary so much that sometimes they are confusing) and it says nothing broader about the meaning of the results from a theoretical and clinical point of view. That final comment may also be a bit harsh but I wander how "true" it is!!!!).

Finally, I hope one day we find ways to model our important groupingss of variables on very very small sample sizes and still produce meaningful results!!!!!

Cheers Paul

> Christina Cutshaw <ccutsha1@jhem.jhmi.edu> wrote: > > Steve, > > Thank you for your comments. I will evaluate my options in light of > your > suggestions. > > Best, > > Chris > > "Simon, Steve, PhD" wrote: > > > Chris Cutshaw writes: > > > > > I am conducting binary logistic regression analyses with a > > > sample size of 73 of which 22 have the outcome of interest (e.g. > are > > > "very successful" versus somewhat/not very successful). I have > > > fourteen variables of interest which I examined in a univariate > > > logistic regression with the dependent variable. Of these > > > fourteen, six have a liklihood-ratio chi-square of p<0.25. Hosmer > & > > > Lemeshow suggest that all variables with a p<0.25 be examined in > the > > > multivariable modeling. I have heard that there should be about > 10 > > cases > > > with the outcome of intertest per independent variable to avoid > > > "overfitting." > > > > > > 1) Does this mean my final model should contain no more than 2 > > > variables? 2) Can I can look at all six variables using a forward > > > stepwise procedure for example, as long as the final model has > only > > > two or three variables? Or should I create several different > > > two or three-variable models and see which combinations yield > > > significant results and compare them in some way? > > > > > > What does "overfitting" actually mean? > > > > I apologize if some of the comments here appear harsh. You are > going to > > have to seriously lower your expectations. That may be > disheartening, > > but better to face the bad news now rather than later. > > > > Overfitting means that some of the relationships that appear > > statistically significant are actually just noise. You will find > that a > > model with overfitting does not replicate well and does a lousy job > of > > predicting future responses. > > > > The rule of 10 observations per variable (I've also heard 15) is > > referring to the number of variables screened, not the number in > the > > final model. Since you looked at 14 variables, you really needed > 140 to > > 210 events of interest (equivalent to 464 to 697 total > observations) to > > be sure that your model is not overfitting the data. > > > > What to do, what to do? > > > > If you are trying to publish these results, you have to hope that > the > > reviewers are all asleep at the switch. Instead of a ratio of 10 or > 15 > > to one, your ratio is 1.6 to one. All 14 variables are part of the > > initial screen, so you can't say that you only looked at six > variables. > > > > Of course, you were unfortunate enough to have the IRB asleep at > the > > switch, because they should never have approved such an ambitious > data > > analysis on such a skimpy data set. So maybe the reviewers will be > the > > same way. > > > > I wouldn't count on it, though. If you want to improve your chances > of > > publishing the results, there are several things you can do. > > > > First, I realize that the answer is almost always "NO" but I still > have > > to ask--is there any possibility that you could collect more data? > In > > theory, collecting more data after the study has ended is a > protocol > > deviation (be sure to tell your IRB). And there is some possibility > of > > temporal trends that might interfere with your logistic model. But > both > > of these "sins" are less serious than overfitting your data. > > > > Second, you could slap the "exploratory" label on your research. > Put in > > a lot of qualifiers like "Although these results are intriguing, > the > > small sample size means that these results may not replicate well > with a > > larger data set." This is a cop-out in my opinion. I've fallen back > on > > this when I've seen ratios of four to one or three to one, but you > don't > > even come close to those ratios. > > > > Third, ask a colleague who has not looked at the data to help. Show > > him/her the list of 14 independent variables and ask which two > should be > > the highest priority, based on biological mechanisms, knowledge of > > previous research, intuition, etc., BUT NOT LOOKING AT THE EXISTING > > DATA. Then do a serious logistic regression model with those two > > variables, and treat the other twelve variables in a purely > exploratory > > mode. > > > > Fourth, admit to yourself that you are trying to squeeze blood from > a > > turnip. A sample of 73 with only 22 events of interest is just not > big > > enough to allow for a decent multivariable logistic regression > model. > > You can't look for the effect of A, adjusted for B, C, and D, so > don't > > even try. Report each individual univariate logistic regression > model > > and leave it at that. > > > > Fifth (and most radical of all), give up all thoughts of logistic > > regression and p-values altogether. Who made a rule that says that > every > > research publication has to have p-values? Submit a publication > with a > > graphical summary of your data. Boxplots and/or bar charts would > work > > very nicely here. Explain that your data set is too small to > entertain > > any serious logistic regression models. If you're unlucky, then the > > reviewers may ask you to put in some p-values anyway. Then you > could > > switch to the previous option. > > > > Sixth, there are some newer approaches to statistical modeling that > are > > less prone to overfitting. Perhaps the one you are most likely to > see if > > CART (Classification and Regression Trees). These models can't > make a > > silk purse out of a sow's ear, but they do have some cross > validation > > checks that make them slightly better than stepwise approaches. > > > > If you asked people on this list how many of them have published > results > > when they knew that the sample size was way too small, almost every > hand > > would go up, I suspect. I've done it more times than I want to > admit. > > Just be sure to scale back your expectations, limit the complexity > of > > any models, and be honest about the limitations of your sample > size. > > > > Good luck! > > > > Steve Simon, ssimon@cmh.edu, Standard Disclaimer. > > The STATS web page has moved to > > http://www.childrens-mercy.org/stats. > > > > P.S. I've adapted this question for one of my web pages. Take a > look at > > > > http://www.childrens-mercy.org/stats/model/overfit.asp


Back to: Top of message | Previous page | Main SPSSX-L page