LISTSERV at the University of Georgia
Menubar Imagemap
Home Browse Manage Request Manuals Register
Previous messageNext messagePrevious in topicNext in topicPrevious by same authorNext by same authorPrevious page (February 2009, week 4)Back to main SAS-L pageJoin or leave SAS-L (or change settings)ReplyPost a new messageSearchProportional fontNon-proportional font
Date:         Wed, 25 Feb 2009 11:10:18 -0500
Reply-To:     Peter Flom <peterflomconsulting@mindspring.com>
Sender:       "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From:         Peter Flom <peterflomconsulting@MINDSPRING.COM>
Subject:      Re: oversampling too much?
Comments: To: Gary <fuguoyi@GMAIL.COM>
Content-Type: text/plain; charset=UTF-8

Gary <fuguoyi@GMAIL.COM> wrote

>I am new to this group, and just started a job with a bank. When >modeling rare events in marketing, it has been suggested by many to >take a sample stratified by the dependent variable(s) in order to >allow the modeling technique a better chance of detecting a >difference. Many literature suggests the proportion of the event in >the sample seems to range between 15-50% for a binary outcome, and we >can use an offset to adjust it. > >The response rate of my current case is 0.3%, and when I build the >model, I oversmapled the response to 25%. However, the trandition here >is to oversample to 1%, and they told me that if oversample too much, >the model will be sensitive. > >Is there any problem oversample from 0.3% (8000 out of 2.2M targets) >to 25% (8000 resps and 24000 non-resps). We have about 500 variables >to build the model. >

I am not an expert on this area, but 1) I don't see how oversampling from an existing data set helps. I could see oversampling when *building* a data set. You want to oversample rare populations so that you have enough people from those populations. But in your situation, I think the only advantage of oversampling would be the speed with with the logistic regression runs.

(That's just my intuition ....)

2) I am concerned with any model that has 500 variables, *regardless* of the number of cases. The rule of thumb of 10-1 is not bad, but it's not ironclad. What are these 500 variables? How are they related?

3) Since you are in marketing, I imagine you are mainly or entirely interested in prediction, rather than explanation. You might consider multimodel averzging (see a book by Burnham and Anderson)

HTH

Peter

Peter L. Flom, PhD Statistical Consultant www DOT peterflomconsulting DOT com


Back to: Top of message | Previous page | Main SAS-L page