Date: Tue, 5 Dec 2006 11:35:59 +1100
ReplyTo: paulandpen@optusnet.com.au
Sender: "SPSSX(r) Discussion" <SPSSXL@LISTSERV.UGA.EDU>
From: Paul Dickson <paulandpen@optusnet.com.au>
Subject: Re: Kmeans vs. hierarchical clustering
ContentType: text/plain
Alina,
There is no right or wrong approach here, there is just a well thought out logical rationale for one choice over another and some basic investigation of your data to explain what is happening. From your outline so far, what I know is that you have ten variables and 64 cases. Typically, I would defer to Hierarchical clustering (HC) given your sample size, since this is the only piece of information you have provided in your posting, apart from disparate findings across the two different algorithms (HC and kmeans). I have read somewhere that hc produces more stable solutions over kmeans with small sample sizes (cannot remember where), you may be able to find some published peer review lit to substantiate your choice of one algorithm over another, based in part (not the only consideration!!!!!) on your sample size. This does not seem to be your case (kmeans gives seeminly more balanced solutions), so here are some other things to look at, using spss.
1. Multicollinearity (is this stuffing up your solutions?)
Before you run your clustering process again, I would first run correlation analysis on your variables and develop a correlation matrix to assess collinearity between the variables. Variables that are highly collinear (i.e. have high correlations should be ommitted from the analysis unless there are theoretical grounds for keeping them in there) should be eliminated. You could also run a quick and dirty PCA on your variables (before I get shot down based on PCA on 64 cases) you are doing this just to see which items load together, looking for general patterns, and not reading too much into your FA results. Then run and rerun your cluster analyses. Develop different solutions (hc and kmeans) with all the variables included, and then eliminate any collinear variables, then rerun your solutions), and see if this has an impact on the differences between the two solutions. That way you can identify or discount multicollinearity as impacting on your solutions.
2. Are your clusters an artefact of the algorithm and really not 'true' clusters, which could explain disparate results across the two different algorithms you used? Given the way spss clusters work, and their shortcomings, here is a little test to run on your solutions.
Depending on how you sort the file and the order of cases, your solutions can vary (oh dear!!!!). Here is what I would do if you have time. This is a quick way to test cluster 'reproducibility'. Generate a set of random id variables at the end of your data set (assign different id numbers to each case). Sort your dataset by each of these different cases (ascending and descending) and then rerun your cluster analyses repeatedly. Save the cluster memberships and then run cross tabs on the different memberships. If your clusters are stable, no matter how you sort the dataset, you should see similar membership patterns across the different sorted solutions. If they are not, you have a clue that the algorithm is not picking up real and reproducible solutions!!!!
(In clustangraphics, I can seed 5000 solutions for kmeans and it generates a reproducibility index based on euclidean sum of squares) that tells me that for different random starting points, my solution is reproduced 75% of the time.
3. What else might be causing the different results (some real and actual patterns in the data)
Are the two different algorithms tapping different patterns across the variables. Profile the clusters on the 10 variables (look at mean and standard deviations) by running a series of anovas using cluster membership and all the ten variables for the kmeans and hierarchical solutions (make sure you have the same cluster numbers). This will give you a picture of what variables your clusters differ on (do not look for significance, look for general patterns here). It may be that the two different algorithms are linking your cases differently, and your profiles will give you some idea about whether this is occurring or not. Look at mean differences and standard deviation sizes. A general rule of thumb is that variables that have smaller sd's and large differences between means are better discriminators between clusters (this depends on the algorithm).
I have identified some simple practical things you can do, hope this helps.
Paul
> Alina Sheyman <asheyman@FAMILYOFFICE.COM> wrote:
>
> Hi all,
>
> I'm trying to figure out what clustering mechanism I should be using for
> my
> analysis. For now I've tried both Kmeans and hierarchichal clustering
> on
> the same data and have ended up with entirely different clusters. In the
> case of Kmeans I got three clusters that are very close in size,
> whereas
> with hierarchical clustering almost all the cases ended up in one
> cluster.
> Is this even possible? (I'm not entirely clear on how Ward's algorithm
> works). Which one should I be using? My database size is about 64 cases,
> and 10 variables were used in clustering.
>
> any advice would be great
>
> Alina Sheyman,
> Family Office Exchange
