**=========================================================================
****Date:** Mon, 31 Jul 2006 15:05:37 +0100
**Reply-To:** "Allan Reese (Cefas)" <allan.reese@cefas.co.uk>
**Sender:** "SPSSX(r) Discussion" <SPSSX-L@LISTSERV.UGA.EDU>
**From:** "Allan Reese (Cefas)" <allan.reese@cefas.co.uk>
**Subject:** Re: two-step cluster method
**Content-Type:** text/plain; charset="iso-8859-1"
I asked on Friday about documentation for two-step cluster and since found on the web:
http://www.norusis.com/pdf/SPC_v13.pdf

There is also a "technical note" on the SPSS website but this is marketing puff and doesn't explain the output. Someone else asked about two-step on spssx-l last December.

Norusis explains that "you are interested in finding the number of clusters at which the ... Information Criterion becomes small and the changes in IC between adjacent number of clusters is small". Her example is, however, similar to mine in that the IC increases monotonically to k=1 [k clusters] and the changes also run monotonically. It's not obvious why then "for this example, the algorithm selected three clusters."

There are columns of ratios and a scree plot might indicate a break point at k=3 for Norusis's example, but the one I posted was not so clear cut.

HOWEVER, we did think of excluding some outliers so made the parameter HANDLENOISE=10.
This has the effect of grouping cases that don't fit any cluster into an outlier group.
The AIC table now makes sense and we get a small number of clusters at minimum AIC plus the odd cases:

Auto-Clustering
N AIC AIC Ratio Ratio of
Clus change AIC change Distance Measures
(a) (b) (c)
1 367.706
2 364.890 -2.815 1.000 4.461
3 429.431 64.540 -22.926 1.174
4 496.861 67.430 -23.952 1.610
5 570.569 73.708 -26.182 1.384
6 647.132 76.563 -27.197 1.087
7 724.289 77.157 -27.408 1.243
8 802.784 78.495 -27.883 1.250
9 882.378 79.594 -28.273 .(d)
a The changes are from the previous number of clusters in the table.
b The ratios of changes are relative to the change for the two cluster solution.
c The ratios of distance measures are based on the current number of clusters against the previous number of clusters.
d Since the distance at the current number of clusters is zero, auto-clustering will not continue.

Cluster Distribution
N % of Combined % of Total
Cluster 1 1356 46.3% 46.3%
2 1522 52.0% 52.0%
Outlier (-1) 49 1.7% 1.7%
Combined 2927 100.0% 100.0%

How to interpret output when HANDLENOISE=0 is still a mystery.

Allan

***********************************************************************************
This email and any attachments are intended for the named recipient only. Its unauthorised use, distribution, disclosure, storage or copying is not permitted. If you have received it in error, please destroy all copies and notify the sender. In messages of a non-business nature, the views and opinions expressed are the author's own and do not necessarily reflect those of the organisation from which it is sent. All emails may be subject to monitoring.