Date: Tue, 22 Feb 2005 17:41:28 -0800
Reply-To: cassell.david@EPAMAIL.EPA.GOV
Sender: "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From: "David L. Cassell" <cassell.david@EPAMAIL.EPA.GOV>
Subject: Re: Multiple Imputation--Missing Data
In-Reply-To: <20050222232726.3CB724BDAA@ws1-1.us4.outblaze.com>
Content-type: text/plain; charset=US-ASCII
"Nick ." <ni14@MAIL.COM> wrote:
> I would like your thoughts on Multiple Impuation.
>
> Q1: Does your data have to satisfy any assumptions in order to
> safely apply PROC MI.
>
> Q2: Is it a good idea to even do this when modeling. Is it better to
> not consider the input if it has a lot of missing values (say more
> than 30% or 40% or whatever a lot means in the particular problem)
> or is it better to apply PROC MI. Is the model going to be more
> accurate in the sense that it reflects reality better?
>
> Q3: What are the options one should use with PROC MI. The syntax of
> this procedure allows for a lot of flexibility. Any guidelines on
> how one should use it or should just leave everything at their default
levels?
I see that David Neal has given you some very good advice. (As always.)
Listen to him.
Now then. PROC MI *does* make assumptions about your data. So
read the directions on the back of the box before using. Are your
data continuous or discrete? Are your missing values in particular
patterns across your data table? These are really important questions
that you should think about.
As (the other) David pointed out, some missing data - and some missing
records - are missing for a reason. If the data are not Missing At
Random (capiltalization to indicate stat jargon), then you should not
assume that they come from the same population as the data you have.
This is *crucial*. Think about a (silly) example. You have lots and
lots of data on grocery store shopping habits. Some people have not
answered about cold medicines. Do we impute? I wouldn't. One of
the key ingredients when making meth is the over-the-counter ingredient
in decongestants and multi-product cold relief stuff. What if the
people who did answer are criminal drug-manufacturers? Should we
assume they are like the rest of your population? I hope not.
The more missing values you have to impute, the more noise and less
signal you have to work with. If you have 30% or 40% missing values
and you don't know what sort of distribution they ought to have, then
you're not going to have very good results with your multivariate
analysis. PROC MIANALYZE will (at least) tell you how much of your
noise
is due to the imputation and how much is not. If all your data are
shot full of holes, you need to first start with QA of the data and
examination of the survey structure, and seeing about repairing some
of those gaping holes if you can.
I can't give you any generic guidelines on what methodologies to use
on what variables. For me, that's a very hands-on task. You want to
model the missings as accurately as you can, even if you have to do it
on a var-by-var basis. I don't recommend using the defaults and
crossing
your fingers. Still, that's better than going with single imputation.
David
--
David Cassell, CSC
Cassell.David@epa.gov
Senior computing specialist
mathematical statistician