LISTSERV at the University of Georgia
Menubar Imagemap
Home Browse Manage Request Manuals Register
Previous messageNext messagePrevious in topicNext in topicPrevious by same authorNext by same authorPrevious page (January 2008, week 3)Back to main SAS-L pageJoin or leave SAS-L (or change settings)ReplyPost a new messageSearchProportional fontNon-proportional font
Date:         Tue, 15 Jan 2008 13:57:05 -0500
Reply-To:     Wensui Liu <liuwensui@GMAIL.COM>
Sender:       "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From:         Wensui Liu <liuwensui@GMAIL.COM>
Subject:      Re: Is GENMOD Stuck???
Comments: To: "data _null_," <datanull@gmail.com>
In-Reply-To:  <7367b4e20801151037p78ea6cdfi4fc28e75db7bfbbd@mail.gmail.com>
Content-Type: text/plain; charset=ISO-8859-1

I should experiment so. This is indeed a good homework to do. However, one thing I'd like to add is that in direct marketing or risk modeling, the prob of an event is very rare. Say in 100K dataset, the prob of 1 might be 0.3%. If we sample 1% out of 100K, then we have 1k. Out of this 1K, we only have 3 events after a stratified sampling. I will never take the risk to model 3 events out of 1000 records.

On Jan 15, 2008 1:37 PM, data _null_, <datanull@gmail.com> wrote: > Since Wensui has "tons" of observations perhaps he could do an experiment. > > Say for some large file that has been modeled and the "answer" is > known. Take 100 or so 1% random samples and see has close the > estimates are. > > > On Jan 15, 2008 12:28 PM, Peter Flom <peterflomconsulting@mindspring.com> wrote: > > Wensui Liu <liuwensui@GMAIL.COM> wrote > > > > >1-2% sample is a very interesting point I've ever seen. > > >The guideline I usually follow is picked up from Hastie 'elements of > > >statistical learning', which says 50% for training, 25% for > > >validation, and 25% for testing. He could be wrong though. ^_^. > > >It seems different games have different rules. > > > > > > > Hastie isn't wrong, and E of SL is a great book. > > > > But he's answering a different question. His guidelines are about how to split up reasonably sized data sets. > > > > What is 'reasonable' - well, depends on your field, the complexity of your model, and computing power. But it's hard to see a case where millions of observations would do anything except slow down the computer. Sure, it makes for more precise estimates, but how precise do you ever need an estimate to be? Even if your model is very very accurate, the model error is going to totally swamp the sampling error with millions of records. > > > > Peter > > >

-- =============================== WenSui Liu Statistical Project Manager ChoicePoint Precision Marketing (http://spaces.msn.com/statcompute/blog) ===============================


Back to: Top of message | Previous page | Main SAS-L page