| Date: | Fri, 21 Apr 2006 15:32:43 +0400 |
| Reply-To: | Anton.Balabanov@fup.unn.ru |
| Sender: | "SPSSX(r) Discussion" <SPSSX-L@LISTSERV.UGA.EDU> |
| From: | Anton Balabanov <Anton.Balabanov@fup.unn.ru> |
| Subject: | Re: Question about initial cluster centers for k-means |
| In-Reply-To: | <741218F5-2AB6-440E-9E3E-CA5674639170@duke.edu> |
| Content-Type: | text/plain; charset="iso-8859-1" |
I agree, that from the algorithm described it is not clear, which cases are
used as VERY initial seeds. The implication
is 'first k cases', however.
The syntax below, may be, is not a proof, but an illustration that SPSS
indeed uses the first k cases as
those unexplained (very initial) seeds.
The 4 points form a square on the scatter. If we request two-cluster
solution than the points 1 and 2 have the same 'rights' to be initial seeds
as the points 3 and 4 (whichever pair are the most distant points).
Clearly, if the 1 and 2 will be the first cases, they will be chosen as
initials. If the 3 and 4 will be the first they will be chosen. For the
1-3-2-4 order the point 2 will replace point 3 and the 1st and 2nd points
will be initials again, etc.
Best,
Anton
DATA LIST LIST /point x y.
BEGIN DATA
1 1 1
2 2 2
3 1 2
4 2 1
END DATA.
GRAPH
/SCATTERPLOT(BIVAR)=x WITH y BY point (NAME)
/MISSING=LISTWISE .
QUICK CLUSTER
x y
/CRITERIA= CLUSTER(2)
/METHOD=KMEANS(NOUPDATE)
/PRINT INITIAL.
SORT CASES BY point (D).
QUICK CLUSTER
x y
/CRITERIA= CLUSTER(2)
/METHOD=KMEANS(NOUPDATE)
/PRINT INITIAL.
-----Original Message-----
From: SPSSX(r) Discussion [mailto:SPSSX-L@LISTSERV.UGA.EDU]On Behalf Of
Kyongje Sung
Sent: Thursday, April 20, 2006 10:14 PM
To: SPSSX-L@LISTSERV.UGA.EDU
Subject: Re: Question about initial cluster centers for k-means
Thanks for your information.
I have the PDF file from SPSS that shows the exact algorithm you
mentioned.
It looks like what you have explained in your message is exactly the
same as the
algorithm explained in that file. According to this algorithm from
SPSS, quick cluster algorithm
consists of three steps and the very first step is what you have
described in
your message except that it does not explain how it chooses the
initial seeds to begin
with. My question is all about this unexplained starting seeds.
It may be the case that the first step is not actually the part of
quick cluster algorithm.
But whatever the situation is, I'm now clear about the k-mean process
and thank you guys.
K. Sung
The algorithm does not explain how to choose the initial seeds.
On Apr 20, 2006, at 12:48 PM, Anton Balabanov wrote:
> Exactly so. There is a syntax at Ray's site which demonstrates how
> to use
> cluster centers from hierarchical for k-means analysis:
> http://www.spsstools.net/Syntax/ClusterA/
> CentersOfHiearchicalCAasInitialValO
> fK-means.txt
>
> Turning back to the original question. The algorithm used for QUICK
> CLUSTER
> in SPSS is described among algorithms (the collection of files
> which could
> be found at SPSS tech. support pages). There is the special
> algorithm for
> choosing well-spaced initial centroids. And it is not the same the
> QUICK
> CLUSTER. And initial seeds are not choosen randomly.
>
> General idea is to take k first nonmissing cases as initial seeds
> and than
> consider the k+1 case. If it's minimum distance to first k seeds is
> greater,
> than minimum distance between any pair from k seeds, the k+1 case
> replaces
> one of the first k seeds from the closest pair of seeds. Et cetera.
> At the
> last case we will have k well-spaced seeds. The clustering itself
> begins.
> That is why the result of the k-means clustering is slightly
> depends on the
> order of cases in the data file.
>
> Much detailed explanation is given, at the Subhash Sharma. Applied
> Multivariate Techniques, Wiley, 1996. Three different algorithms
> explaned.
> One of them is said to be used by SAS, I think, the one of 3 is
> used by SPSS
> as well. If someone interested, I could scan and send to your private
> e-mails corresponding pages.
>
> Best,
> Anton
>
>
>
>
> -----Original Message-----
> From: SPSSX(r) Discussion [mailto:SPSSX-L@LISTSERV.UGA.EDU]On
> Behalf Of
> Hector Maletta
> Sent: Thursday, April 20, 2006 8:21 PM
> To: SPSSX-L@LISTSERV.UGA.EDU
> Subject: Re: Question about initial cluster centers for k-means
>
>
> Your strategy is sound, Dan, but I don't see the relation to
> choosing SAS
> over SPSS: you can do exactly the same with SPSS. QUICK CLUSTER
> accepts a
> matrix of starting cluster means to begin the analysis, and the
> matrix could
> of course come from a previous hierarchical clustering exercise.
> Hector
>
> -----Mensaje original-----
> De: SPSSX(r) Discussion [mailto:SPSSX-L@LISTSERV.UGA.EDU] En nombre
> de Dan
> Zetu
> Enviado el: Thursday, April 20, 2006 1:11 PM
> Para: SPSSX-L@LISTSERV.UGA.EDU
> Asunto: Re: Question about initial cluster centers for k-means
>
> This is one reason I use SAS for clustering. I typically start with
> hierarchical clustering, find the optimal number of clusters I
> need, save
> their centroids and subsequently submit them as "seeds" to a k-means
> technique. As of now, I have not been able to figure out an equivalent
> procedure in SPSS.
>
> Dan
>
>> From: Kyongje Sung <kjsung@duke.edu>
>> Reply-To: Kyongje Sung <kjsung@duke.edu>
>> To: SPSSX-L@LISTSERV.UGA.EDU
>> Subject: Question about initial cluster centers for k-means
>> Date: Thu, 20 Apr 2006 10:23:35 -0400
>>
>> Hi, everyone...
>>
>> I have went through all the postings about cluster analysis and still
>> not clear
>> about the way K-means analysis choose initial cluster centers.
>> When I looked at the manual for K-means analysis about initial
>> centers, it says,
>>
>> "By default, a number of well-spaced cases equal to the number of
>> clusters
>> is selected from the data"
>>
>>
>> I'm not sure about the part "well-spaced cases". Since K-means
>> procedure uses
>> QUICK CLUSTER algorithm and syntax, does this "well-spaced cases"
>> mean that
>> it used quick cluster algorithm to find the initial centers? Since
>> quick cluster
>> algorithm goes through all data points and makes initial
>> centers as far away as possible, I guess this process means "well-
>> spaced cases"...
>>
>> If it is, Quick Cluster, still, needs to start with some kind of seed
>> (?) value to find
>> the initial cluster centers. How does SPSS choose these initial value
>> for quick cluster algorithm so that
>> it can calculate real(?) initial cluster centers for k-means?
>> Dose SPSS choose randomly from the data for the seed points for quick
>> cluster?
>>
>>
>> K. Sung
============================================
Kyongje Sung, Ph.D.
--------------------------------------------
Postdoctoral Fellow, Purves Lab
Center for Cognitive Neuroscience
Levine Science Research Center (Rm. B243E)
Box 90999
Duke University
Durham, NC 27708
--------------------------------------------
Email: k.sung@duke.edu
Tel: (919) 684-6276
Fax: (919) 681-0815
www.mind.duke.edu
www.purveslab.net
============================================
|