shiling99@YAHOO.COM sagely replied:
>With 10 million and 0.3% of them are events. You have 30000 events. I
>would suggest you randomly select 20000/15000 among them and left
>10000/15000 for validation purpose . And you also randomly 20000/15000
>from no events. In this way, you cut down the size a lot and make your
>developping process easier and does not loss much of estimation
>efficiency. If you still think it is too big, you may scale down
>further. Be careful in calculating the sample weight or you may use
>proc surveyselect. proc surveyselect will spit out the weight for you.
>This approach will have much higher efficiency than a simple random
>sampling in which there are less events.
Good points all. Thanks for pointing this out.
Let me add a couple thoughts:
Due to the small proportion we are sampling for, we may need to
stratify on additional variables so we get enough values in various
categories of auxiliary variables. This is a generally-ignored problem,
because it requires stopping and thinking about the data. :-)
Since the poster will end up with a survey sample from a finite population
(the original database), and that sample may be fairly complex (at a
minimum, we will have differing sampling weights and probably
stratification), we need to use PROC SURVEYLOGISTIC instead of
PROC LOGISTIC to get the right variances.
David L. Cassell
3115 NW Norwood Pl.
Corvallis OR 97330