| Date: | Wed, 17 Oct 2001 09:55:17 +0100 |
| Reply-To: | David Hitchin <D.H.Hitchin@SUSSEX.AC.UK> |
| Sender: | "SPSSX(r) Discussion" <SPSSX-L@LISTSERV.UGA.EDU> |
| From: | David Hitchin <D.H.Hitchin@SUSSEX.AC.UK> |
| Subject: | Re: Does PCA assume normality? |
|
| In-Reply-To: | <sbcd6e32.041@dlwcmail.dlwc.nsw.gov.au> |
| Content-Type: | text/plain; charset=us-ascii |
Christopher,
You have it just about right.
PCA is simply the rotation of the data set to align it along new axes such
that the first has the largest possible variance, the second has the next
largest possible variance, and so on, until all of the variance has been
accounted for.
If the data values are treated purely as numbers, there is no further
problem, but if you think of the data values as representing some
quantities in the real world, then the numbers are dependent on the
physical units of measurement - do you measure the height of a person in
feet, inches, or metres - and smaller units of measurement produce bigger
numbers, and therefore they pull the PCA solution closer to the variables
with the larger numbers. In order to compensate for this (in my view in a
rather arbitrary way) people often standardise their variables, i.e. they
subtract the means and divide by the standard deviations. This in many
packages is the choice between doing PCA on the covariance matrix or the
correlation matrix.
PCA uses ALL of the available variance, with no assumption of error
anywhere, although users, after carrying out PCA often drop the smaller
components. You can therefore consider it as a purely spatial problem
rather than a statistical one.
One distinguising characteristic of factor analysis is that the presence of
error in the data is an essential part of the model. Another is that the
underlying factors are not simple linear components of the original
variables which can be calculated exactly, but are more complicated
functions that can only be estimated - there are several different
estimation methods, each of which has some desirable properties, but no
method has all of the desirable properties at once.
Generally principal component analysis is considered to be conducted on a
population; the data set that you have is all the data that you will ever
have. If you consider your data set to be a sample, then clearly you might
have drawn many other samples, and each one would have its own Principal
Component solution. In these circumstances you might want to ask how
representative your particular data set might be of the population, and
then you get into significance tests and you might want to calculate
standard errors of the estimates, a very difficult problem. Significance
tests, standard errors and confidence intervals all require some
assumptions about the distribution of the data, and generally when
assumptions are made they are of normality. Any other distributions would
be even more difficult to handle.
David Hitchin
University of Sussex
|