Date: Wed, 23 Feb 2005 15:44:21 -0800
Reply-To: cassell.david@EPAMAIL.EPA.GOV
Sender: "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From: "David L. Cassell" <cassell.david@EPAMAIL.EPA.GOV>
Subject: Re: Multiple Imputation--Missing Data
In-Reply-To: <20050223190442.C52E2101D0@ws1-3.us4.outblaze.com>
Content-type: text/plain; charset=US-ASCII
"Nick ." <ni14@mail.com> also wrote to me (instead of to SAS-L):
> My data set has inputs each having about 10 to 30% missing values.
> When I use PROC MI (I have SAS Version 8.2) with 5 inputs (these are
> the inputs I wish to impute missing values of), I get
>
> WARNING: The initial covariance matrix for MCMC is singular.
> You can use a PRIOR= option to stabilize the inference.
>
> I have no idea what that means or how to get this experimental (a
> word for not reliable???) version of PROC MI to give me imputed
> values. I also believe that the missing values are MAR (missing at
> Random). These are fields we buy from a vendor--fields like MSA and
> demographic data, etc. Examples are dollar amounts, salaries, home
> market values, etc. It is not like your example below where I
> wouldn't want to impute. (But then again, maybe, I still shouldn't
> impute???) I cannot guarantee, however, that the data we buy follow
> the normality assumptions. This is real life data, not ivory tower
> data. So, I guess, my questions now are:
>
> how do I fix the warning message
>
> how do I get SAS to impute
>
> should I even use PROC MI (experimental, data probably not normal,
> etc.) or should I use some other SAS procedure?
I see you have solved some of your problems in the other message
you sent to me (even though you need to write to SAS-L and not to
me personally). You are limited in what PROC MI can do in SAS 8.2,
and you may have to do some of the multiple imputation by hand.. or
upgrade to SAS 9.1.3 to get more help.
And, since your data are purchased, the Missing At Random assumption
is (most likely) unknowable. Ugh. Go ahead and assume it. Then
caveat it mercilessly in your documentation to CYA. (That stands for
Cover Your Assumptions. Or something close.) Explain why you HAD to
assume it and why you CANNOT verify the assumption, and go forward.
But document it.
> Finally, say I do get SAS to impute 5 times. So, if my data set has
> 3 obs, then I will get an imputed data set 3 x 5 = 15 obs with
> IMPUTED NUMBER = 1 (3 obs), IMPUTED NUMBER = 2 (3 obs), etc. What do
> I do with this data set? Do I build one model with IMPUTED NUMBER =
> 1, then another with IMPUTED NUMBER =2, etc. and select the best
> model out of the 5 imputations? (Best in my work means best lift. My
> line of work is banking/finance/campaigns ...)
First, the more missing values you have, the more you should consider
upping that default '5 times'. I find that m=5 works fine for me when
all my variables have less than 10% missings. You don't have that.
think about increasing m to something over 5. The more you increase
m, the larger your output data set and the longer it takes to run your
analyses. But you need m big enough for your estimates to be stable.
You can do that by running this with m=5, m=10, ... until the estimates
stabilize. You probably don't need do go higher than m=20 or 30.
The way that PROC MI works is like a 'wrapper function'. You run
PROC MI on your data. You get m replicates, all in a now-larger
data set. Each replicate has a different value of the variable
_IMPUTATION_ . Run your planned analysis using the BY statement,
with this by-variable. Then feed the results into PROC MIANALYZE
to see the impact of the imputation.
And look to see if you can register at www.sierrainformation.com for
the course "Treatment of Missing Data via Maximum Likelihood and
Multiple Imputation" by Paul Allison, anytime soon. Or consider
getting your boss to hire a statistical consultant to lead you through
these dark and twisty passages, all alike. This is complex enough
that you cannot really expect some yahoo in Oregon to be able to
walk you through all your data problems when he can't sit down and
work with your data.
HTH,
David
--
David Cassell, CSC
Cassell.David@epa.gov
Senior computing specialist
mathematical statistician
|