Date: Thu, 6 May 2010 17:52:00 -0400
Reply-To: Sigurd Hermansen <HERMANS1@WESTAT.COM>
Sender: "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From: Sigurd Hermansen <HERMANS1@WESTAT.COM>
Subject: Re: 00 SEMMA Methodology Limitations and risks
Content-Type: text/plain; charset="us-ascii"
While random sampling from a sample may reduce computational and memory demands of statistical modeling and learning programs, random sampling may largely eliminate extreme observations in a normal distribution, or, from another angle, reduce subsamples that may include observations of interest to the point that estimates will not have statistical significance. Oversampling of small subsamples of special interest may work around this inevitable "data truncation" problem, but then require weighting of observations to adjust for oversampling.
With respect to model selection, random selection of a smaller sample from a larger sampling may eliminate influential observations. The results may be better or worse. In any event, random sampling would tend to increase model uncertainty even when sampling from a population, though not as severely as
When sampling from a sample. For many data mining and imputation problems, analysts select several independent random samples and "model average" as part of efforts to make predictive models more robust.
Many of the SAS statistical procedures one might use to estimate parameters of a model support weighting using counts from class summaries of finer grained population or sample data. Using class summaries as a data reduction method may accomplish the purpose as sampling without introducing some of the limitations of sampling. Besides, class summaries often contribute to model exploration and discovery.
From: SAS(r) Discussion [mailto:SAS-L@LISTSERV.UGA.EDU] On Behalf Of SUBSCRIBE SAS-L Olivier Van Parys
Sent: Thursday, May 06, 2010 3:03 PM
Subject: 00 SEMMA Methodology Limitations and risks
I have read somewhere that while being a very good methodology the Sample,
Explore, Modify, Model, and Assess pushed implemented in EMiner also had some
limitations. Do you guys remember which one?
The only one I can think off relates to the sampling aspect which can be a
limitation if totally random (bias and large standard errors). Stratified
sampling is a potential solution but one need to isolate the relevant
variables which is not convenient if the model is built for the 1st time.
Would that be correct? - what other limitations can you think of?
Would love to hear your views.
Olivier Van Parys, PhD
Google - Global Sales and Market Intelligence