Date: Wed, 4 Jan 2006 15:46:37 -0800
Reply-To: David L Cassell <davidlcassell@MSN.COM>
Sender: "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From: David L Cassell <davidlcassell@MSN.COM>
Subject: Re: MCD Outlier Determination
Content-Type: text/plain; format=flowed
>Happy New Year, and so forth. I'm returning to a topic I asked about
>recently. I want to try using Minimum Covariance Determinant estimation
>to locate outliers. SAS/IML has a function called MCD to implement this
>methodology. Essentially, as I understand it, the algorithm finds
>the "best" half of the data by minimizing the determinant of the
>covariance matrix of a large number of subsamples, and then computing
>robust Mahalanobis-type distances based on this "best" half. The
>distances are then compared with a cutoff, and any distances above the
>cutoff are considered outliers. The MCD function returns the set of
>distance values, as well as a vector of zeroes and ones, where the zeroes
>denote outliers, i.e., values of the robust distances above the cutoff
>point. From what I can tell, the cutoff for the MCD function is ALWAYS
>fixed as the square root of the .975 quantile of the chi-square
>distribution with n degrees of freedom, where n is the number of
>covariates. My question is: is there a way to vary the cutoff in the MCD
>call, or do you have to do it by hand with the set of distances it returns?
You're sort of close on the method. You take h% of the data as your cutoff.
You can't do exactly half, but you can do 1 + N/2. There's an upper limit
well. The default is something like (N+n+1)/2 where N and n are as you
discussed, the number of obs and the number of regressors (including an
intercept if you have one). Then you do a lot of sampling (hey, did someone
say 'sampling'?) from the original to get that robust estimate, which is
an objective function F_sub_MCD.
As I understand the SAS set-up, you cannot vary the cutoff in the MCD
call. That's later on. You'll have to take the set of distances it returns
and do the cutoff you prefer.
There is one alternative you could try. Rather than doing this via SAS/IML,
you could try using PROC ROBUSTREG and letting it do the fit. You'd still
be assigning your own cutoff and stuff. But it might be easier. It would
make it easier to use ODS Statistical Graphics to plot the results.
David L. Cassell
3115 NW Norwood Pl.
Corvallis OR 97330
Is your PC infected? Get a FREE online computer virus scan from McAfeeŽ