Date: Tue, 8 Feb 2000 11:20:57 -0500
Reply-To: "J. Das" <jdas@INTEGRAINFO.COM>
Sender: "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From: "J. Das" <jdas@INTEGRAINFO.COM>
Subject: Re: Missing Values; was PROC MEANS question
In-Reply-To: <4.2.2.20000208092323.00b03640@mail.binghamton.edu>
Content-Type: text/plain; charset="iso-8859-1"
>There are a variety of methods for missing value imputation.
Here are some of the methods in imputing missing data as discussed in
standard textbooks:
1. The imputed value can be selected from the sample distribution. This is
known as Hot Deck Imputation.
2. The missing values are replaced by a constant value which can be obtained
from appropriate external sources. This is called Cold Deck Imputation.
3. Missing values can be substituted by the means calculated from the sample
with responding units. This is called Mean Imputation.
4. Missing values can be predicted from a regression of the missing item on
items observed for the unit. This is called Regression Imputation.
5. And "Multiple Imputation Methods" as discussed in a recent paper by T.E.
Raghunathan and G.D. Paulin in American Statistical Association 1998
Proceedings of the Business and Economic Statistics Section. The References
in their paper gives a list of textbooks and articles that might be helpful.
Jayanta
---------------------------------
Dr. Jayanta Das
Senior Econometrician
Integra Information, Inc.
Flanders, NJ 07828
-----Original Message-----
From: SAS(r) Discussion [mailto:SAS-L@LISTSERV.UGA.EDU]On Behalf Of Lary
Jones
Sent: Tuesday, February 08, 2000 9:57 AM
To: SAS-L@LISTSERV.UGA.EDU
Subject: Missing Values; was PROC MEANS question
At 08:16 AM 2/8/2000 -0500, Miller, Scott wrote:
>one thing you can do that is fairly acceptable where there would be some
>missing responses to questions in a series that constitutes a 'scale score'
>is to substitute the missing response with the mean of the other responses,
>as long as the subject answers more than .5 of the series. if the subject
>fialed to respond to at least .5 of the questions, the scale should be
>unscored (score=.)
This is an accepted approach for many (though I question 50%; I would set a
much lower proportion). Nevertheless, I am hesitant to recommend any kind
of missing value replacement without knowing the data.
First of all, it is often forgotten that deciding to replace missing values
depends on the assumption that the missing values occur randomly. I think
this rarely happens. It is easily understood that items which are of lower
quality (confusing, difficult to answer) and which deal with
acknowledgement of undesirable qualities will have more missing
answers. One often focuses on the relation of a item to others in the
"scale." It is important to consider the number of missing values, across
respondents, as well. I do not have in hand an exclusion rule, but I would
be very uncomfortable of any item which is missing for more than 10% of the
respondents. A noticeable collection of missing values for an item raises
questions of reliability, if not validity.
There are a variety of methods for missing value imputation. I think this
is a case where the techniques may be outstripping our general knowledge
about appropriateness. We can devise a number of techniques which preserve
properties of a distribution. The question is really, are we applying
these techniques without thinking about the meaning of the data. Is it
better to use the sum of items with means replacing the missing values, or
to use the mean ignoring missing values? How many items in a scale do we
allow to be missing? How many missing values for an item is still "ok"?
Being in the computing services game for the last 25 years, my knowledge of
the literature is limited. I welcome the comments of others on this issue.
-lary jones
_______________________________________________________
Lary Jones % Statistical Computing Analyst
Computing Services % ..........................
Binghamton University % LJones@Binghamton.EDU
Binghamton, NY 13902-6000 % (607) 777-2614