|
shiling99@YAHOO.COM sagely replied:
>
>With 10 million and 0.3% of them are events. You have 30000 events. I
>would suggest you randomly select 20000/15000 among them and left
>10000/15000 for validation purpose . And you also randomly 20000/15000
>from no events. In this way, you cut down the size a lot and make your
>developping process easier and does not loss much of estimation
>efficiency. If you still think it is too big, you may scale down
>further. Be careful in calculating the sample weight or you may use
>proc surveyselect. proc surveyselect will spit out the weight for you.
>This approach will have much higher efficiency than a simple random
>sampling in which there are less events.
>
>HTH
Good points all. Thanks for pointing this out.
Let me add a couple thoughts:
Due to the small proportion we are sampling for, we may need to
stratify on additional variables so we get enough values in various
categories of auxiliary variables. This is a generally-ignored problem,
because it requires stopping and thinking about the data. :-)
Since the poster will end up with a survey sample from a finite population
(the original database), and that sample may be fairly complex (at a
minimum, we will have differing sampling weights and probably
stratification), we need to use PROC SURVEYLOGISTIC instead of
PROC LOGISTIC to get the right variances.
HTH,
David
--
David L. Cassell
mathematical statistician
Design Pathways
3115 NW Norwood Pl.
Corvallis OR 97330
_________________________________________________________________
http://liveearth.msn.com
|