Date: Tue, 15 Aug 2006 09:04:31 -0400
Reply-To: Art@DrKendall.org
Sender: "SPSSX(r) Discussion" <SPSSX-L@LISTSERV.UGA.EDU>
From: Art Kendall <Art@DrKendall.org>
Organization: Social Research Consultants
Subject: Re: Cluster Analysis - Seeds needed for K-Means
In-Reply-To: <003201c6bfde$3e0d6cf0$a200a8c0@NOTEBOOK>
Content-Type: text/plain; charset=us-ascii; format=flowed
Hector Maletta's approach sounds very useful.
I'll keep it in mind the next time I'm using dfa to interpret or refine
a clustering.
-- some elabaoration --
In the 70's I started calling sets of cases "core clusters" when several
different agglomeration methods and/or distance measures placed those
cases together.
I then used DFA to refine assignments/interpretation. A case was
considered unclassified for the first phase of a dfa if it was far
from the centroid or if it was a "splitter" across the probabilities.
What constitutes a splitter is subjective and you might want to try
different approaches. Obviously, (.98, .01, .01) is a very definite
assignment, while (.33, .34, .33) is a very ambiguous assignment. You
might want to try different criteria such as at least .55 with next best
being no more than .4 or best at least .1 better than second best.
The dfa was run iteratively until table of "original" and "assigned"
groups was as stable as it could be.
Another reason to use dfa is that, although the "tests" in the first
phase should not be interpreted in the conventional way they can be
useful in interpreting what distinguishes the cluster profiles.
Another way to word Hector's point about sampling, is that the clusters
in exploratory terminology, can be very useful as strata in sampling
terminology.
Art
Social Research Consultants
Hector Maletta wrote:
>I agree with Art Kendall opinion that "In DFA, I recommend closely examining
>the probabilities of assignment to each cluster for each case, and the
>probability that a member of a cluster would be as far away from the
>centroid as this particular case is. This is a very old but very useful aid
>in interpreting a clustering. The classification phase of DFA should provide
>insight into the reliability of the cluster assignments."
>Besides using or not using DFA for this purpose, cases far away from the
>centroid are often of doubtful usefulness. In one exercise I did with a
>large sample some time ago, I applied clustering to create a certain number
>of clusters, but there were a lot of cases of borderline membership. We
>figured a small amount of measurement error would land those cases in
>another cluster altogether.
>For certain research purpose it proved useful to divide each cluster into a
>"core" and a "periphery", the core being a relatively small area around the
>centroid. This is only useful when many cases are near the centroid, and few
>are in the no-man's land or borderline area between clusters, far away from
>the centroid.
>I do not remember all the details, but I do remember I tried several ways of
>defining the core, including the following: (1) all cases situated within
>the minimum distance from the centroid that encompassed, say, 25% of all
>cases in the cluster; (2) all cases, whichever their number or proportion as
>long as they were at least 30, located within an Euclidean distance of, say,
>one cluster-specific standard deviation from the centroid.
>The "core" of the cluster is usually quite homogeneous, and proved a very
>useful tool to define the "typical" features of the cluster, and to select
>typical cases for frequent follow-up, at least for means if not for
>variability around the mean.
>In fact, what we did was creating a "model" (a "model farm-household" in
>that experience) defined by the centroid values of all variables,
>periodically re-evaluating those values by following-up a small rotational
>sample of cases randomly selected from the core. Since the centroid was
>supposed to be defined by the mean of those variables for the entire cluster
>(core+periphery), we boldly multiplied the updated centroid means times the
>clusters' total membership to obtain updated population means and totals in
>an economical way (this was done in order to monitor rural development at
>farm/household level in a poor developing country, where large sample
>surveys cannot be carried out with the necessary frequency, and casual
>visits by extension workers are not enough).
>Hope this helps.
>
>Hector Maletta
>
>-----Mensaje original-----
>De: SPSSX(r) Discussion [mailto:SPSSX-L@LISTSERV.UGA.EDU] En nombre de Art
>Kendall
>Enviado el: Monday, August 14, 2006 2:13 PM
>Para: SPSSX-L@LISTSERV.UGA.EDU
>Asunto: Re: Cluster Analysis - Seeds needed for K-Means
>
>It is some time since I used version 12, but the hierarchical
>clustering part has been around for since the 70's.
>If you used the SAVE specification, you should have a new variable that
>indicates for each case to which cluster it is assigned. say you called
>it Kluster3 and the variables to base the clustering on Var01 to Var12.
>
>
>to get the centroids
>(I'm not sure how you would have interpreted the cluster meanings
>without using DISCRIMINANT or means already.)
>discriminant groups= kluster3 (1,3)/ variables = var01 to var12 . . ..
>
>or
>means tables= var01 to var12 by kluster3 /cells= count means . . . .
>
>once you type the above command into a syntax window, highlight (select)
>the procedure name with you mouse and click the syntax button to see
>other possibilities for the procedure.
>
>In DFA, I recommend closely examining the probabilities of assignment to
>each cluster for each case, and the probability that a member of a
>cluster would be as far away from the centroid as this particular case
>is. This is a very old but very useful aid in interpreting a clustering.
>The classification phase of DFA should provide insight into the
>reliability of the cluster assignments.
>
>The GUI in SPSS is very useful for the first draft of your syntax.
>Simply exit the menus via the "paste" button. This shows you the syntax
>that will do what you specified in the menu. As you look at your
>results, and as you develop your approach you can simply edit the pasted
>syntax.
>
>To get your means into a .sav file. There are more automated ways to
>get the centroids into kmeans, but this is straightforward.
>open a new data file
>label the variables kluster3 and var01 ... var12.
>key in the centroids.
>save the file.
>
>
>
>You might also want to consider applying the TWOSTEP procedure.
>It will produce AIC and BIC to check on the number of clusters to retain.
>
>Art Kendall
>Social Research Consultants
>
>
>Aaron Eakman wrote:
>
>
>
>>I am using SPSS 12 for my clustering procedures. I started with
>>heirarchical clustering using Wards method with squared euclidean
>>distance. I have identified a three cluster solution as the best option
>>
>>
>>from a possible range of 2-4 that I established a priori.
>
>
>>Here is my problem, I want to next run a K-means clustering procedure.
>>More specifically, I want to use the centroids of the three clusters from
>>my heirarchical procedure as "seed" or starting values for the K-means
>>clustering procedure. Unfortunately, SPSS does not generate this output
>>
>>
>>from the heirarchical procedure. And I do not know 1) how to generate
>
>
>>cluster centroids from the cluster assignment information provided by SPSS
>>heirarchical procedure, and 2) even if I did, I do not know how
>>to generate an SPSS.sav file with that information for use by the K-means
>>approach. A further problem, I am a point and clicker and not savvy with
>>command syntax; I AM WILLING TO LEARN IF IT CAN GET ME OUT OF MY MESS!!
>>
>>Any persons that are SPSS - Cluster Analysis savvy, or know others that
>>might lend a hand would be met with gratitude for any assistance.
>>
>>Take care,
>>
>>Aaron Eakman
>>
>>
>>
>>
>>
>>
>
>
>
>
>
>
|