Date: Sun, 10 Sep 2006 22:19:49 -0700
Reply-To: David L Cassell <davidlcassell@MSN.COM>
Sender: "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From: David L Cassell <davidlcassell@MSN.COM>
Subject: Re: Interesting missing data and estimation problem (was Macrofor
regression outputs and scoring)
Content-Type: text/plain; format=flowed
> >>> David L Cassell <davidlcassell@MSN.COM> 8/31/2006 1:02 am >>> wrote
>The point of multiple imputation is to provide a way to see how much
>of the noise in your model is from your data, and how much is from
>your imputation process. If you only compute single imputation, you
>are doomed to losing this information. You can't ever know how
>good your guesses are if you only get a single estimate. And that
>estimate will have a different standard error than your regular data,
>although you will end up ignoring that. So how much damage are
>you doing to your estimation process by avoiding the issue?
>You only need a handful of replicates, unless you have a huge
>proportion of missings in your data.
>Have I convinced you yet? :-)
>I think MI is a great solution to most missing data problems, but not
>for mine. Either I am missing something, or I haven't been clear enough
>in what we are trying to do. So, I'll attempt to clarify, and see what
>you (and any others who are following along) think. The combination of
>PROC REG and PROC SCORE solves the problem I thought we had, but perhaps
>we have more problems than we knew about.....
>Our goal is to estimate the number of drug injectors in each of 95
>cities for each of 11 years. We have four estimates of this number.
>Each of these is biased (as you can imagine, it's a hard thing to
>measure) and three of the series have some missing data. All sources
>are 'noisy' in that there are some numbers that are clearly far off.
>So the plan we came up with is as follows:
>1) Examine the data for numbers that are clearly wrong. (e.g., if the
>estimate for one year is much much smaller or much much larger than the
>years before and after, or if an estimate is so low that it cannot be
>2) Set these to missing.
>3) When a source is missing for a particular city and year, do a single
>imputation based on a regression for the same year for the other cities
>(e.g., suppose we are missing the 1998 estimate for Pittsburgh for data
>source A. We would then use the data on the other 94 cities to run a
>regression of A on B, C, and D, and use that regression on Pittsburgh).
>4) Combine the four sources
>5) Smooth the data, probably with Loess.
>we also contemplate reversing the order of steps 4 and 5.
>Given this plan, I don't see the value of MI. The mutliple imputations
>will just get averaged out in step 4, won't they?
>Thanks as always for all comments and thoughts.
However you do it, you have to bear in mind that the underlying
regression assumptions are important. If you do enough non-OLS
stuff, you can sort of avoid the identically-distributed part, but you
will still have the major issue of not knowing how much of your noise
is due to the data, and how much is due to your imputation
process. MI is the only way to get around that aspect of the
David L. Cassell
3115 NW Norwood Pl.
Corvallis OR 97330
Check the weather nationwide with MSN Search: Try it now!