Date: Thu, 6 Nov 2008 16:46:12 -0500
Reply-To: T J <tj_noreply@YAHOO.COM>
Sender: "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From: T J <tj_noreply@YAHOO.COM>
Subject: Re: Complex do-looping and/or conditional processing
As mentioned before, you don't have to re-invent the wheels. If you have
proc REG, then you should have proc ROBUSTREG. They are all in SAS/STAT.
Here is some code:
proc robustreg data=one method=LTS;
model ind_var = dep_var/leverage cutoff=3 ;
output out=xout outlier=out1 leverage=lev1 sr=sr1 ;
run;
quit;
which produces output dataset XOUT:
Obs ind_var dep_var sr1 out1 lev1
1 2.0 7.0 -0.5175 0 0
2 2.5 9.2 -0.8274 0 0
3 2.7 8.8 0.0268 0 0
4 3.1 10.1 0.1305 0 0
5 3.3 11.6 -0.4673 0 0
6 3.2 9.9 0.5576 0 0
7 1.0 4.0 -0.9676 0 0
8 1.1 6.2 -2.3746 0 0
9 2.0 5.5 0.6288 0 0
10 3.0 8.0 1.4610 0 0
11 3.3 9.9 0.8318 0 0
12 0.8 18.0 -12.2150 1 1
13 1.2 4.0 -0.4190 0 0
14 1.0 2.0 0.5608 0 0
15 1.8 5.0 0.4624 0 0
16 3.3 4.8 4.7293 1 0
As you can see that obs # 12 and #16 are marked as outliers for you. If you
read a little bit on Reussue(?) and LTS (least trimmed squares) on outlier
detection, you would understand that some of you want has already been
worked out and packaged in the proc.
-TJ
On Thu, 6 Nov 2008 16:11:13 -0500, Ryan Utz <rutz@AL.UMCES.EDU> wrote:
>Wow, thanks for the response. I hope I can make it clear what I'm looking
for...
>
>Consider the data set below (a simplified version of what I'm working
with).
>There are two variables-one dependent and one independent:
>
>data one; input ind_var dep_var; cards;
>2 7
>2.5 9.2
>2.7 8.8
>3.1 10.1
>3.3 11.6
>3.2 9.9
>1 4
>1.1 6.2
>2 5.5
>3 8
>3.3 9.9
>0.8 18
>1.2 4
>1 2
>1.8 5
>3.3 4.8
>;
>
>When I plot this out using gplot:
>
>proc gplot data=one; plot dep_var*ind_var; run;
>
>one can easily see that there is a quite obvious outlier (probably
>representing an erroneous measurement or incorrectly entered data point)
for
>the point (18, 0.8) while a trend is evident in the rest of the data. There
>may be another outlier point (4.8, 3.3), but who knows if I should consider
>excluding it or not? At least the first one is obvious. Using PROC REG I
can
>get several useful regression diagnostics. I'm still working out which ones
>to use and the criteria for point elimination, but the code is as follows:
>
>proc reg data=one noprint;
>model dep_var=ind_var; output out=two (keep=dep_var ind_var r cd)
>rstudent=r cookd=cd; run;
>
>'r' and 'cd' are regression diagnostics. For both, the greater the absolute
>value, the more suspect the data point. Say, for instance, that I want to
>exclude any point where 'r' exceeds 2 (a real criteria used by some). I can
>easily do this manually. But once I do so, the regression dynamics shift
and
>other points that may be above '2' in the original regression may not do so
>after removing this outlier point. So you see, I'm trying to come up with
an
>iterative process until each 'r' value for each point is <2, but I need to
>do it one at a time. I have a number of variable sets to look at and most
>of them have well over 200 points-that's why I'm trying to automate the
process.
>
>Any thoughts?
|