LISTSERV at the University of Georgia
Menubar Imagemap
Home Browse Manage Request Manuals Register
Previous messageNext messagePrevious in topicNext in topicPrevious by same authorNext by same authorPrevious page (November 2005, week 2)Back to main SAS-L pageJoin or leave SAS-L (or change settings)ReplyPost a new messageSearchProportional fontNon-proportional font
Date:         Mon, 14 Nov 2005 11:16:10 -0500
Reply-To:     "Nick ." <ni14@MAIL.COM>
Sender:       "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From:         "Nick ." <ni14@MAIL.COM>
Subject:      Re: Compute Predicted Probabilities of Oversampled Dependent
              Variable
Content-Type: text/plain; charset="iso-8859-1"

David C. responded to the folowing question: > I am working woth PROC LOGISTIC and my dependent variable is RESPONSE=1/0 > (1=RESPONSE to campaign, 0=no RESPONSE to campaign). > I have a model. > I use the following SAS statements to give me the predicted (or fitted) > probabilities. > As an example: > PROC LOGISTIC data=MYDATA descending noprint; > model RESPONSE=VAR1 VAR2 COUNTER; > output out=probs predicted=phat; > run; > > I assume that SAS (I'm using V8.2) has done all the necessary math to give > me the correct phat. I further assume that phat represents the probability > that each observation will take on the value 0 ( I hope to see low values > close to 0 in this case) or 1 ( I hope to see high values close to 1 in > this case). If I plot phat hopefully I will something resembling an S curve > boundeded by 0 and 1. Am I correct so far? > > Assuming I am correct so far here is the twist now: > > The response rate in my 1 million or so data set is very low like 0.30%. > > EXAMPLE OF OVERSAMPLING THE RESPONSE (dependent variable): > My data set contains 1 million records with dependent variable > RESPONSE=0/1. > I will use 5% as response rate to make numbers easy. > This yields about 50K 1s and 950K 0s. (5-95 split) > I also have 500 inputs VAR1-VAR500 let's say. > I do not have time to be staring at the computer screen. > I randomly select 10% of 50K records to give me 5,000 1s. > I randomly take 0.5% of 950K records to give me 4,750 0s. > I now have about 10K records with even split about (50-50) to build the > model. > I won't sate at computer screen for long now. > I build the model. > I run the SAS code above to give me phat probabilities. > But these phat probabilities aren't quite correct now due to the > over-sampling. Right? > These probabilities are off due to the intercept being incorrect. > The right intercept should be b0-log(0.10/0.5) > > QUESTION: How do I get SAS to give me the correct adjusted probabilities? > Do I even need to adjust these probbailities since the intercept doesn't > matter sometimes. > I will use this model to assign propensity to respond to a campaign offer. > I must have the (statistically) correct propensity (probability) of > response or my head will be served on a plate/platter/something else ??? > :-)

[1] The probabilities may not be incorrect. Which probabilities are you looking at?

[2] The easiest way to fix *all* of this is to use PROC SURVEYLOGISTIC instead. Just define a SIZE variable which reflects how you are doing the sampling. Or define this as stratified sampling (which fits your description better). Then let the proc compute the posterior probabilities with correct standard errors, based on your sample design. It could be that easy.

[3] If you don't have time to stare at your computer screen, then how do you have time to wait for days for one of us to get back to you on this stuff? :-) :-)

HTH, David -- David L. Cassell mathematical statistician Design Pathways 3115 NW Norwood Pl. Corvallis OR 97330 ----------------------------------------------------------------------

Thanks David... Pertaining to your answer [1] I am looking at the probabilities phat. (Are there other probabilities of interest in this case?) I assume (correctly I hope) that phat represents the probability that each RESPONSE observation will take on the value 0 ( I hope to see low values close to 0 in this case) or 1 ( I hope to see high values close to 1 in this case). These are the probabilities, phat, which I will use to predict the expected response rate when I apply the model to a new set of customers/prospects.

Pertaining to [3], I am mostly interested in in improving model prediction by doing the 50-50 split (as opposed to the real split reflected by the data of 5% Responders-95% Nonresponders. Maybe I am better off doing a 25% Response-75% Nonresponse...any ideas?). It is also very true that when I work with big data sets like the one I mention above, the server crashes. So I must randomly select a smaller sample and work as explianed above. Yes, it ***DOES*** take 2 or 3 days of running server time (if it doesn't crash) when I work with a million+ records and 600 inputs (numeric and character). I do have time to wait for your response since I will only have to wait once on ***you*** but always on the computer if I don't change my ways.

David and others, could you please be kind enough to show me how you would use PROC SURVEYLOGISTIC to do what you suggest above. I am off to the web to read about it as soon as I sent this note to you. Thanks.

NICK

-- ___________________________________________________ Play 100s of games for FREE! http://games.mail.com/


Back to: Top of message | Previous page | Main SAS-L page