Date: Fri, 14 Jan 2005 13:28:16 -0500
Reply-To: "Nick ." <ni14@MAIL.COM>
Sender: "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From: "Nick ." <ni14@MAIL.COM>
Subject: Statistical question--oversampled responders
Content-Type: text/plain; charset="iso-8859-1"
Hello Dear SAS experts,
I desperately need a solution to this statistical problem.
I am working with an outside vendor about getting modeling data. This vendor gives the data to our company and then a team looks at it and then send it over to me for modeling. Today I found out this horrible (?) thing that they had done. Here is the situation:
Someone has 7 million records of data. About 22,000 responders and the remaining nonresponders. This translates to a 0.3% response rate. So here is what they did.
They took out ALL 22,000 responders from the 7 million records and from the remaining non-responder population they randomly selected 950,000 records. So, they send me over a dataset of about 972,000 records having a response rate of about 2.3%. I built the model on that and today I find out that they had done that to me!!!! Clearly, I cannot use the model based on the 972,000 records to score the 7 million records due to the oversampling of the responders. As a sidenote, please keep in mind that I used a software package (KXEN) to built the model. I didn't use SAS and PROC LOGISTIC.
My question is this: Know the information above and knowing that I don't use PROC LOGISTIC, how can modify my left hand side of the equation, i.e. the responders to truly reflect the 0.3% response rate? I told them to go back and give me a RANDOM sample of 1 million records out of the 7 million (that would solve my problem) but they won't do that. They want me to built the model on the data they sent me and then somehow ***adjust*** for the oversampling of the responders.
Sign-up for Ads Free at Mail.com