Date: Wed, 25 Feb 2009 11:10:18 0500
ReplyTo: Peter Flom <peterflomconsulting@mindspring.com>
Sender: "SAS(r) Discussion" <SASL@LISTSERV.UGA.EDU>
From: Peter Flom <peterflomconsulting@MINDSPRING.COM>
Subject: Re: oversampling too much?
ContentType: text/plain; charset=UTF8
Gary <fuguoyi@GMAIL.COM> wrote
>I am new to this group, and just started a job with a bank. When
>modeling rare events in marketing, it has been suggested by many to
>take a sample stratified by the dependent variable(s) in order to
>allow the modeling technique a better chance of detecting a
>difference. Many literature suggests the proportion of the event in
>the sample seems to range between 1550% for a binary outcome, and we
>can use an offset to adjust it.
>
>The response rate of my current case is 0.3%, and when I build the
>model, I oversmapled the response to 25%. However, the trandition here
>is to oversample to 1%, and they told me that if oversample too much,
>the model will be sensitive.
>
>Is there any problem oversample from 0.3% (8000 out of 2.2M targets)
>to 25% (8000 resps and 24000 nonresps). We have about 500 variables
>to build the model.
>
I am not an expert on this area, but
1) I don't see how oversampling from an existing data set helps. I could see
oversampling when *building* a data set. You want to oversample rare populations so that
you have enough people from those populations. But in your situation, I think the
only advantage of oversampling would be the speed with with the logistic regression runs.
(That's just my intuition ....)
2) I am concerned with any model that has 500 variables, *regardless* of the number of cases.
The rule of thumb of 101 is not bad, but it's not ironclad. What are these 500 variables? How are they
related?
3) Since you are in marketing, I imagine you are mainly or entirely interested in prediction, rather than explanation. You might consider multimodel averzging (see a book by Burnham and Anderson)
HTH
Peter
Peter L. Flom, PhD
Statistical Consultant
www DOT peterflomconsulting DOT com
