Date: Wed, 25 Feb 2009 11:10:18 -0500
Reply-To: Peter Flom <firstname.lastname@example.org>
Sender: "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From: Peter Flom <peterflomconsulting@MINDSPRING.COM>
Subject: Re: oversampling too much?
Content-Type: text/plain; charset=UTF-8
Gary <fuguoyi@GMAIL.COM> wrote
>I am new to this group, and just started a job with a bank. When
>modeling rare events in marketing, it has been suggested by many to
>take a sample stratified by the dependent variable(s) in order to
>allow the modeling technique a better chance of detecting a
>difference. Many literature suggests the proportion of the event in
>the sample seems to range between 15-50% for a binary outcome, and we
>can use an offset to adjust it.
>The response rate of my current case is 0.3%, and when I build the
>model, I oversmapled the response to 25%. However, the trandition here
>is to oversample to 1%, and they told me that if oversample too much,
>the model will be sensitive.
>Is there any problem oversample from 0.3% (8000 out of 2.2M targets)
>to 25% (8000 resps and 24000 non-resps). We have about 500 variables
>to build the model.
I am not an expert on this area, but
1) I don't see how oversampling from an existing data set helps. I could see
oversampling when *building* a data set. You want to oversample rare populations so that
you have enough people from those populations. But in your situation, I think the
only advantage of oversampling would be the speed with with the logistic regression runs.
(That's just my intuition ....)
2) I am concerned with any model that has 500 variables, *regardless* of the number of cases.
The rule of thumb of 10-1 is not bad, but it's not ironclad. What are these 500 variables? How are they
3) Since you are in marketing, I imagine you are mainly or entirely interested in prediction, rather than explanation. You might consider multimodel averzging (see a book by Burnham and Anderson)
Peter L. Flom, PhD
www DOT peterflomconsulting DOT com