LISTSERV at the University of Georgia
Menubar Imagemap
Home Browse Manage Request Manuals Register
Previous messageNext messagePrevious in topicNext in topicPrevious by same authorNext by same authorPrevious page (December 2005, week 1)Back to main SAS-L pageJoin or leave SAS-L (or change settings)ReplyPost a new messageSearchProportional fontNon-proportional font
Date:         Wed, 7 Dec 2005 17:42:12 -0500
Reply-To:     Sigurd Hermansen <HERMANS1@WESTAT.COM>
Sender:       "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From:         Sigurd Hermansen <HERMANS1@WESTAT.COM>
Subject:      Re: Subject: Logist Model Build--How big a dataset to use
Comments: To: "Nick ." <ni14@mail.com>
Content-Type: text/plain; charset="us-ascii"

Nick: What about the idea of taking a random sample of variables? That would take care of those pesky problems with step-wise selection!

On a more serious note, I do see data mining experts' advising colleagues to sample rows (observations) of very large datasets and use the sample to develop statistical models. In fact, it's the first step in SAS's recommended SEMMA strategy (sample, explore, modify, model, assess), though described as a result of representative sampling. (Don't know how that works when already has a dataset to analyze.)

Now it does make sense when the number of observations allows to divide a data source randomly into training, test, and validation samples, and set the test and validation samples aside (no peeking). Even so, that may still leave a lot of observations to process.

I find it surprising that no one has recommended summarization of data as an alternative to sampling from an existing dataset. Many SAS statistical PROC's handle summarized data very efficiently. To the extent that summarizing does not affect essential features of data, estimates do not vary significantly, if at all. Sig

-----Original Message----- From: owner-sas-l@listserv.uga.edu [mailto:owner-sas-l@listserv.uga.edu] On Behalf Of Nick . Sent: Wednesday, December 07, 2005 3:27 PM To: sas-l@listserv.uga.edu Subject: Subject: Logist Model Build--How big a dataset to use

* Subject: Logist Model Build--How big a dataset to use

Hi, I posted the following question (in part) a few days back

>>> Hello, I have a data set of 1.2 million records and 11K responders (0.92% response rate). I would like to build a predictive logistic model using as small a dataset as possible so as to avoid unnecessary variables creeping into the model and thereby increasing the misclassification (prediction) error. For example:

METHOD1 Would it be a good idea to randomly sample the 1.2 records, select 5% of them and build the model? Or should I randomly select 10%? What's is the percentage as a rule of thumb?

... ... ... Thanks. NICK >>>

and D.C. responded (in part)

>>> The best way to avoid unnecessary variables creeping into the model (assuming you are doing the right things in the model building process) is to include as many observations as you can. The possible bad effects of any bad data can be swamped if you have enough good data. :-)

. . . . . . . . .

HTH, David -- David L. Cassell mathematical statistician Design Pathways 3115 NW Norwood Pl. Corvallis OR 97330 >>>

D.C. wanted me to site the book I got my question inspired from. I promised I would list the book reference:

STATISTICAL MODELING and ANALYSIS for DATABASE MARKETING by Bruce Ratner

On p.36 (under logistic Chapter 3) he says:

"There is statistical factoid that states if the true model can be built with small data, then the model built with extra big data produces large prediction error variance. Data analysts are never aware of the true model, but are guided when building it by the principle of simplicity. Therefore, it is wisest to build the model with small data. If the predictions are good, then the model is a good approximation of the true model; if predictions are not acceptable, then the EDA procedure prescribes an increase data sixe (by adding predictor variables and individuals) until the model produces good predictions. The data size, with which the model produces good predictions, is big enough. If extra big data are used, unnecessary variables tend to creep into the model, thereby increasing the prediction-error variance."

-- ___________________________________________________ Play 100s of games for FREE! http://games.mail.com/


Back to: Top of message | Previous page | Main SAS-L page