LISTSERV at the University of Georgia
Menubar Imagemap
Home Browse Manage Request Manuals Register
Previous messageNext messagePrevious in topicNext in topicPrevious by same authorNext by same authorPrevious page (May 2010, week 1)Back to main SAS-L pageJoin or leave SAS-L (or change settings)ReplyPost a new messageSearchProportional fontNon-proportional font
Date:         Thu, 6 May 2010 17:52:00 -0400
Reply-To:     Sigurd Hermansen <HERMANS1@WESTAT.COM>
Sender:       "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From:         Sigurd Hermansen <HERMANS1@WESTAT.COM>
Subject:      Re: 00 SEMMA Methodology Limitations and risks
Comments: To: SUBSCRIBE SAS-L Olivier Van Parys <olivier.vanparys@GMAIL.COM>
In-Reply-To:  <201005061903.o46GJHVP027585@malibu.cc.uga.edu>
Content-Type: text/plain; charset="us-ascii"

Olivier: While random sampling from a sample may reduce computational and memory demands of statistical modeling and learning programs, random sampling may largely eliminate extreme observations in a normal distribution, or, from another angle, reduce subsamples that may include observations of interest to the point that estimates will not have statistical significance. Oversampling of small subsamples of special interest may work around this inevitable "data truncation" problem, but then require weighting of observations to adjust for oversampling.

With respect to model selection, random selection of a smaller sample from a larger sampling may eliminate influential observations. The results may be better or worse. In any event, random sampling would tend to increase model uncertainty even when sampling from a population, though not as severely as When sampling from a sample. For many data mining and imputation problems, analysts select several independent random samples and "model average" as part of efforts to make predictive models more robust.

Many of the SAS statistical procedures one might use to estimate parameters of a model support weighting using counts from class summaries of finer grained population or sample data. Using class summaries as a data reduction method may accomplish the purpose as sampling without introducing some of the limitations of sampling. Besides, class summaries often contribute to model exploration and discovery. S

-----Original Message----- From: SAS(r) Discussion [mailto:SAS-L@LISTSERV.UGA.EDU] On Behalf Of SUBSCRIBE SAS-L Olivier Van Parys Sent: Thursday, May 06, 2010 3:03 PM To: SAS-L@LISTSERV.UGA.EDU Subject: 00 SEMMA Methodology Limitations and risks

Hi Everyone,

I have read somewhere that while being a very good methodology the Sample, Explore, Modify, Model, and Assess pushed implemented in EMiner also had some limitations. Do you guys remember which one? The only one I can think off relates to the sampling aspect which can be a limitation if totally random (bias and large standard errors). Stratified sampling is a potential solution but one need to isolate the relevant variables which is not convenient if the model is built for the 1st time. Would that be correct? - what other limitations can you think of? Would love to hear your views.

Regards,

Olivier Van Parys, PhD Google - Global Sales and Market Intelligence


Back to: Top of message | Previous page | Main SAS-L page