Date: Wed, 7 Dec 2005 17:42:12 -0500
Reply-To: Sigurd Hermansen <HERMANS1@WESTAT.COM>
Sender: "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From: Sigurd Hermansen <HERMANS1@WESTAT.COM>
Subject: Re: Subject: Logist Model Build--How big a dataset to use
Content-Type: text/plain; charset="us-ascii"
What about the idea of taking a random sample of variables? That would
take care of those pesky problems with step-wise selection!
On a more serious note, I do see data mining experts' advising
colleagues to sample rows (observations) of very large datasets and use
the sample to develop statistical models. In fact, it's the first step
in SAS's recommended SEMMA strategy (sample, explore, modify, model,
assess), though described as a result of representative sampling. (Don't
know how that works when already has a dataset to analyze.)
Now it does make sense when the number of observations allows to divide
a data source randomly into training, test, and validation samples, and
set the test and validation samples aside (no peeking). Even so, that
may still leave a lot of observations to process.
I find it surprising that no one has recommended summarization of data
as an alternative to sampling from an existing dataset. Many SAS
statistical PROC's handle summarized data very efficiently. To the
extent that summarizing does not affect essential features of data,
estimates do not vary significantly, if at all.
From: firstname.lastname@example.org [mailto:email@example.com]
On Behalf Of Nick .
Sent: Wednesday, December 07, 2005 3:27 PM
Subject: Subject: Logist Model Build--How big a dataset to use
* Subject: Logist Model Build--How big a dataset to use
I posted the following question (in part) a few days back
I have a data set of 1.2 million records and 11K responders (0.92%
response rate). I would like to build a predictive logistic model using
as small a dataset as possible so as to avoid unnecessary variables
creeping into the model and thereby increasing the misclassification
(prediction) error. For example:
Would it be a good idea to randomly sample the 1.2 records, select 5% of
them and build the model? Or should I randomly select 10%? What's is the
percentage as a rule of thumb?
and D.C. responded (in part)
The best way to avoid unnecessary variables creeping into the model
(assuming you are doing the right things in the model building process)
is to include as many observations as you can. The possible bad effects
of any bad data can be swamped if you have enough good data. :-)
. . .
. . .
. . .
David L. Cassell
3115 NW Norwood Pl.
Corvallis OR 97330
D.C. wanted me to site the book I got my question inspired from. I
promised I would list the book reference:
STATISTICAL MODELING and ANALYSIS for DATABASE MARKETING by Bruce Ratner
On p.36 (under logistic Chapter 3) he says:
"There is statistical factoid that states if the true model can be built
with small data, then the model built with extra big data produces large
prediction error variance. Data analysts are never aware of the true
model, but are guided when building it by the principle of simplicity.
Therefore, it is wisest to build the model with small data. If the
predictions are good, then the model is a good approximation of the true
model; if predictions are not acceptable, then the EDA procedure
prescribes an increase data sixe (by adding predictor variables and
individuals) until the model produces good predictions. The data size,
with which the model produces good predictions, is big enough. If extra
big data are used, unnecessary variables tend to creep into the model,
thereby increasing the prediction-error variance."
Play 100s of games for FREE! http://games.mail.com/