Date: Fri, 20 Jul 2007 11:34:58 -0300
Reply-To: Hector Maletta <firstname.lastname@example.org>
Sender: "SPSSX(r) Discussion" <SPSSX-L@LISTSERV.UGA.EDU>
From: Hector Maletta <email@example.com>
Subject: Re: Two Step Clustering Confusion
Content-Type: text/plain; charset="us-ascii"
Clustering is a heuristic tool, not an analytical or inferential
statistical procedure. There is no "right" clustering solution in a
statistical sense. You may produce (and use or discard) a number of
clustering solutions, according to your research purposes and needs.
At first sight, your results would suggest that the differences (in
other variables) between the categories in your one categorical variable are
so distinctive that it mandates putting each category in a different
cluster. Whether this is actually so could be ascertained by an analysis of
variance, to determine whether the variance of the other variables BETWEEN
categories of the categorical variable is or is not far greater than the
average variance WITHIN categories.
Another issue is the way two-step clustering proceeds. Its ability
to re-compute cluster centres is relatively limited, as compared for
instance with k-means clustering. Thus it is possible that it starts
assigning the various categories to different clusters, and this initial
allocation is only marginally affected by subsequent calculations based on
the other 34 variables.
One possibility you may attempt is using the other 34 variables (or
perhaps other usable variables in your dataset) to assign numerical values
to each category of the categorical variable, in effect converting the
categorical variable into an interval one, and only then performing the
clustering exercise. The allocation of numerical values to your categorical
variable could be achieved, for instance, with categorical factor analysis
(CATCPA) using the 34+1 variables (plus perhaps other relevant variables
that you consider as good predictors of the categorical one). The final
clustering may be done by two step or by k-means.
Hope this helps.
From: SPSSX(r) Discussion [mailto:SPSSX-L@LISTSERV.UGA.EDU] On Behalf Of
Sent: 19 July 2007 18:53
Subject: Two Step Clustering Confusion
I have a clustering solution that seems overly dependent on a
variable. I am hoping someone on the list can help me understand,
suggest some analyses that will help me understand what is going
I am clustering 70,000 cases with the Two Step procedure in SPSS.
using the default outlier detection and standardizing all
variables. I am letting the procedure detect the number of
automatically. When I cluster my cases on 35 variables, where 1 is
categorical with 13 classes and the remaining 34 are continuous I
solution with 6 clusters plus an outlying cluster. When I cross
six clusters by the included categorical variable I find a very
association between the two variables. For example over 99% of
three of the
segments consists of a single (though each different) value of the
When I remove the categorical variable I get a two cluster solution
outlying cluster). To my surprise there was very little
between the two cluster and six cluster assignments of my 70,000
This feels like my one categorical variable is driving the overall
despite the inclusion of 34 other variables. This seems like an
clustering solution to me.
Thank you for you thoughts, Jason
Educator / Analyst