Date:   Thu, 26 Jul 2007 14:21:47 -0700
Reply-To:   David L Cassell <davidlcassell@MSN.COM>
Sender:   "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From:   David L Cassell <davidlcassell@MSN.COM>
Subject:   Re: Modeling Question--Transforming a Variable
In-Reply-To:   <>
Content-Type:   text/plain; format=flowed

shiling99@YAHOO.COM sagely replied: > >With 10 million and 0.3% of them are events. You have 30000 events. I >would suggest you randomly select 20000/15000 among them and left >10000/15000 for validation purpose . And you also randomly 20000/15000 >from no events. In this way, you cut down the size a lot and make your >developping process easier and does not loss much of estimation >efficiency. If you still think it is too big, you may scale down >further. Be careful in calculating the sample weight or you may use >proc surveyselect. proc surveyselect will spit out the weight for you. >This approach will have much higher efficiency than a simple random >sampling in which there are less events. > >HTH

Good points all. Thanks for pointing this out.

Let me add a couple thoughts:

Due to the small proportion we are sampling for, we may need to stratify on additional variables so we get enough values in various categories of auxiliary variables. This is a generally-ignored problem, because it requires stopping and thinking about the data. :-)

Since the poster will end up with a survey sample from a finite population (the original database), and that sample may be fairly complex (at a minimum, we will have differing sampling weights and probably stratification), we need to use PROC SURVEYLOGISTIC instead of PROC LOGISTIC to get the right variances.

HTH, David -- David L. Cassell mathematical statistician Design Pathways 3115 NW Norwood Pl. Corvallis OR 97330


