Date: Wed, 17 Oct 2001 09:55:17 +0100 David Hitchin "SPSSX(r) Discussion" David Hitchin Re: Does PCA assume normality? To: Chris Howden text/plain; charset=us-ascii

Christopher,

You have it just about right.

PCA is simply the rotation of the data set to align it along new axes such that the first has the largest possible variance, the second has the next largest possible variance, and so on, until all of the variance has been accounted for.

If the data values are treated purely as numbers, there is no further problem, but if you think of the data values as representing some quantities in the real world, then the numbers are dependent on the physical units of measurement - do you measure the height of a person in feet, inches, or metres - and smaller units of measurement produce bigger numbers, and therefore they pull the PCA solution closer to the variables with the larger numbers. In order to compensate for this (in my view in a rather arbitrary way) people often standardise their variables, i.e. they subtract the means and divide by the standard deviations. This in many packages is the choice between doing PCA on the covariance matrix or the correlation matrix.

PCA uses ALL of the available variance, with no assumption of error anywhere, although users, after carrying out PCA often drop the smaller components. You can therefore consider it as a purely spatial problem rather than a statistical one.

One distinguising characteristic of factor analysis is that the presence of error in the data is an essential part of the model. Another is that the underlying factors are not simple linear components of the original variables which can be calculated exactly, but are more complicated functions that can only be estimated - there are several different estimation methods, each of which has some desirable properties, but no method has all of the desirable properties at once.

Generally principal component analysis is considered to be conducted on a population; the data set that you have is all the data that you will ever have. If you consider your data set to be a sample, then clearly you might have drawn many other samples, and each one would have its own Principal Component solution. In these circumstances you might want to ask how representative your particular data set might be of the population, and then you get into significance tests and you might want to calculate standard errors of the estimates, a very difficult problem. Significance tests, standard errors and confidence intervals all require some assumptions about the distribution of the data, and generally when assumptions are made they are of normality. Any other distributions would be even more difficult to handle.

David Hitchin University of Sussex

Back to: Top of message | Previous page | Main SPSSX-L page