LISTSERV at the University of Georgia
Menubar Imagemap
Home Browse Manage Request Manuals Register
Previous messageNext messagePrevious in topicNext in topicPrevious by same authorNext by same authorPrevious page (February 2005, week 4)Back to main SAS-L pageJoin or leave SAS-L (or change settings)ReplyPost a new messageSearchProportional fontNon-proportional font
Date:         Tue, 22 Feb 2005 17:41:28 -0800
Reply-To:     cassell.david@EPAMAIL.EPA.GOV
Sender:       "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From:         "David L. Cassell" <cassell.david@EPAMAIL.EPA.GOV>
Subject:      Re: Multiple Imputation--Missing Data
In-Reply-To:  <20050222232726.3CB724BDAA@ws1-1.us4.outblaze.com>
Content-type: text/plain; charset=US-ASCII

"Nick ." <ni14@MAIL.COM> wrote: > I would like your thoughts on Multiple Impuation. > > Q1: Does your data have to satisfy any assumptions in order to > safely apply PROC MI. > > Q2: Is it a good idea to even do this when modeling. Is it better to > not consider the input if it has a lot of missing values (say more > than 30% or 40% or whatever a lot means in the particular problem) > or is it better to apply PROC MI. Is the model going to be more > accurate in the sense that it reflects reality better? > > Q3: What are the options one should use with PROC MI. The syntax of > this procedure allows for a lot of flexibility. Any guidelines on > how one should use it or should just leave everything at their default levels?

I see that David Neal has given you some very good advice. (As always.) Listen to him.

Now then. PROC MI *does* make assumptions about your data. So read the directions on the back of the box before using. Are your data continuous or discrete? Are your missing values in particular patterns across your data table? These are really important questions that you should think about.

As (the other) David pointed out, some missing data - and some missing records - are missing for a reason. If the data are not Missing At Random (capiltalization to indicate stat jargon), then you should not assume that they come from the same population as the data you have. This is *crucial*. Think about a (silly) example. You have lots and lots of data on grocery store shopping habits. Some people have not answered about cold medicines. Do we impute? I wouldn't. One of the key ingredients when making meth is the over-the-counter ingredient in decongestants and multi-product cold relief stuff. What if the people who did answer are criminal drug-manufacturers? Should we assume they are like the rest of your population? I hope not.

The more missing values you have to impute, the more noise and less signal you have to work with. If you have 30% or 40% missing values and you don't know what sort of distribution they ought to have, then you're not going to have very good results with your multivariate analysis. PROC MIANALYZE will (at least) tell you how much of your noise is due to the imputation and how much is not. If all your data are shot full of holes, you need to first start with QA of the data and examination of the survey structure, and seeing about repairing some of those gaping holes if you can.

I can't give you any generic guidelines on what methodologies to use on what variables. For me, that's a very hands-on task. You want to model the missings as accurately as you can, even if you have to do it on a var-by-var basis. I don't recommend using the defaults and crossing your fingers. Still, that's better than going with single imputation.

David -- David Cassell, CSC Cassell.David@epa.gov Senior computing specialist mathematical statistician


Back to: Top of message | Previous page | Main SAS-L page