LISTSERV at the University of Georgia
Menubar Imagemap
Home Browse Manage Request Manuals Register
Previous messageNext messagePrevious in topicNext in topicPrevious by same authorNext by same authorPrevious page (December 2006, week 3)Back to main SAS-L pageJoin or leave SAS-L (or change settings)ReplyPost a new messageSearchProportional fontNon-proportional font
Date:         Mon, 18 Dec 2006 22:16:24 -0800
Reply-To:     David L Cassell <davidlcassell@MSN.COM>
Sender:       "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From:         David L Cassell <davidlcassell@MSN.COM>
Subject:      Re: Macro Error
Comments: To: Excimer@163.COM
In-Reply-To:  <1166357018.006145.280190@16g2000cwy.googlegroups.com>
Content-Type: text/plain; format=flowed

Excimer@163.COM wrote back: >"David L Cassell дµÀ£º >" > > Oh dear. > > > > Is this really your company's process for evaluating outliers? > > > > Can you *really* have 50% or more of the data equal to > > exactly one value? That doesn't sound like anything that would > > relate to your choice of 4.9303 times the (Q2-Q1).. which doesn't > > look that good anyway, since the standard rules relate to the > > interquartile range or the H-spread, not the value of Q2-Q1 or > > Q3-Q2. > > > > So I think that you may not need the comlexity you have > > in this code; and > > you may want to go back and make sure that the values used > > for the checks are correct. > > > > Even if they are correct, I would not use them alone. Dropping > > high and low values without regard to *other* variables is a > > poor decision, since you end up losing important records which > > may merely show strong relationships with other variables. > > > > HTH, > > David > > -- > > David L. Cassell > > mathematical statistician > > Design Pathways > > 3115 NW Norwood Pl. > > Corvallis OR 97330

> >Hi David, > >Thanks for your comments. >Actually, use Q2-Q1 and Q3-Q2 instead of interquartile range is the >consideration of skewness.

I don't see that this is warranted. What are your citations for doing this when a 'normality check' is intended?

And where do the cutoff values come from? This is crucial. The classic cutoffs based on hinges are built using rules of thumb from work by Tukey.

>The outliers are excluded before normality check.

Okay, that is a bad idea. If the data are normal, you do not need this 'outlier hacking'. If the data are not normal, then hacking off tails may completely distort the data and their usefulness. If the data are normal except for data contamination, I do not see how hard-wired cutoff points are the solution.

Plus, the whole idea of a 'normality check' puts the skewness issues in a different light. What is the point? If the data are skewed, there's no point in doing a normality check. If the data are not skewed, there's no reason to do this Q3-Q2 vs. Q2-Q1 process.

*AND* if you really have to worry about this kind of skewness, comparing Q3 to Q1 is just wrong. Q3-Q2 or Q2-Q1 may be 0 when the other difference is not. Which invalidates the next steps of your code.

> The purpose of this >program is just for classification. >If they are normal distributed, the other windows programs will control >the data based on normal procedures, >Otherwise, the other windows programs will take special care of them.

Okay, now you have made a bad statistical error. You have done an _a_priori_ screen on the data, thrown out data points, and then done a *conditional* statistical analysis without adjusting your hypothesis evaluation for this conditionality. So your p-values and CIs and such are now messed up.

Let me re-iterate. This looks like your company is doing The Wrong Thing. You need to talk to your boss and get him to hire a statistical consultant to fix this stuff and get better analytical procedures in place.

>However, the current program is really slow since the oracle database >is too large.

The problem may be elsewhere: are you absolutely positive that the bottleneck is not in the data reads or the data transport?

But yes, the process is long and clunky, and needs to be fixed. I think it needs to be fixed from the ground up, beginning with a re-examination of the fundamental rules for your process.

>I really thanks if anyone can help suggest some improve directions. >

I just wrote some, but you're probably not going to like them.

David -- David L. Cassell mathematical statistician Design Pathways 3115 NW Norwood Pl. Corvallis OR 97330

_________________________________________________________________ Your Hotmail address already works to sign into Windows Live Messenger! Get it now http://clk.atdmt.com/MSN/go/msnnkwme0020000001msn/direct/01/?href=http://get.live.com/messenger/overview


Back to: Top of message | Previous page | Main SAS-L page