Date: Fri, 10 Jul 2009 16:03:45 -0300
Reply-To: Hector Maletta <hmaletta@fibertel.com.ar>
Sender: "SPSSX(r) Discussion" <SPSSX-L@LISTSERV.UGA.EDU>
From: Hector Maletta <hmaletta@fibertel.com.ar>
Subject: Re: Trying to Do Principal Components Analysis With Lots of
Pairwise Missing Data
In-Reply-To: <596601.88316.qm@web110412.mail.gq1.yahoo.com>
Content-Type: multipart/alternative;
Zachary,
The fact that your correlation matrix is not positive definite is a
completely different problem than the numerous missing values your dataset
contains. (Of course, if you had no missing values at all, perhaps the
correlation matrix would have different values in its cells, and then it
might be positive definite, but that’s just hypothetical: it may well happen
that even with no missing data the matrix still fails to be positive
definite. Any square symmetric nxn matrix A is positive definite if for any
x it is xAx’>0, where x is a row vector of n real numbers and x’ is its
(column vector) transpose. So is normally the case for correlation matrices,
but it may fail in particular cases of singularity or colinearity or some
other quirk. It might conceivably arise even in the absence of missing
values.
Now, leaving this problem aside, the large number of missing data in your
dataset would almost certainly preclude any useful attempt to perform PCA,
unless you are ready to adopt heroic assumptions and to engage in no less
audacious procedures, some of which you suggest.
Replacing missing data with the grand mean is not advisable. There are much
better ways, as the ones included in the SPSS Missing Values module, e.g.
assigning values for a missing variable based on a regression that predicts
that variable based on other related variables. Your problem is that in most
cases for which variable X is missing, the attempt to predict X as a
function of other variables U, V, W, …., Z may fail because probably one or
more of those predictors would also be missing. You may have to hunt around
for the best set of non-missing predictors to predict each particular
missing value, but this may lead to inconsistencies: you would use some
predictors to predict the missing AGE of John, and another set of predictors
for the AGE of Mary, depending on which predictors are missing for John and
which for Mary (and for each of your other subjects in the sample). Your
message does not tell how many cases are in your data set, but this may be a
long endeavor involving thousands of individual missing cells to be
predicted by different equations each. I do not know whether any Hot Deck
software can automate this process, but I do have doubts about its
reliability in case it exists.
If “Each respondent … randomly answers about 1/3 of the questions”, then
probably the questions are to some extent interchangeable. John answered
some questions, Mary answered others, but if the survey design let them
“randomnly answer about 1/3 of the questions” it would look as if the
questions answered by each person are more or less interchangeable, any with
any, or some with some. Are there subsets of questions, such as a set of
questions about some subject matter and another subset about another? ¿Some
questions about how the product smells, some questions about its status
meaning, some about its health properties, and so on? You may want to treat
attributes belonging to the same “family” of attributes as equivalent or
interchangeable, thus greatly simplifying your work: Mary answered four
“smell” questions, no matter which specifically she chose, and so you have
valid values for Mary in four smell variables, even if she only answered
four of the 12 smell questions available.
Hope any of this rambling answer helps.
Hector
_____
From: SPSSX(r) Discussion [mailto:SPSSX-L@LISTSERV.UGA.EDU] On Behalf Of
Zachary Feinstein
Sent: 10 July 2009 13:42
To: SPSSX-L@LISTSERV.UGA.EDU
Subject: Trying to Do Principal Components Analysis With Lots of Pairwise
Missing Data
I have a situation where there are a total of 150 attributes. Each
respondent to my survey randomly answers about 1/3 of the questions, so they
each get about 50 questions.
I wish to do a Factor Analysis/PCA on the data but clearly that much missing
data is a problem. I can get a correlation matrix and try to run the PCA
off of that. But the PCA will not work because the matrix is not "positive
definite." I figured changing all of the counts to something constant in my
correlation "matrix" data file (the one with the ROWTYPE_ and other such
variables) would trick SPSS into not seeing all of the pairwise missing data
but I still get the same error message.
So yes I am trying to trick SPSS into not viewing the plethora of missing
data. Below are some ideas. I would love any and all feedback on my ideas
as well as some other ideas:
1. Mean-sub the data like crazy. This means 2/3 of the data will be
based on mean-subbed data. I figure mean sub-by the variable and by the
person average too.
2. Somehow add random noise to either the raw data or the correlation
matrix. Not entirely sure what this would accomplish besides getting rid of
some linear dependencies.
3. Seek out the linear dependencies and maybe drop a few variables (or
randomly adjust them). I have a MANOVA command I did this with but I think
that MANOVA does not want missing data. Correct me if I am wrong.
4. Bootstrapping. But this will take a long time to bootstrap missing
raw data.
5. Hot-Deck Imputation. Have heard a bit about this but do not know much
about it.
6. Missing-Value module in SPSS.
7. Amelia module that I used many years ago. I did not like the
missing-value imputation that it did.
Yes, I recognize that we are replacing structurally missing data almost as
if it is randomly missing. But surely there must be a way. I know that it
is not Kosher to run PCA with so much missing data but I need to figure
something out. I am very interested in your feedback. Thank you.
Zachary
zsfeinstein@yahoo.com
(651) 698-2184
[text/html]