Date: Thu, 21 Mar 2002 16:14:27 -0300
Reply-To: Marcos Sanches <marcos_sanches@gallup.com>
Sender: "SPSSX(r) Discussion" <SPSSX-L@LISTSERV.UGA.EDU>
From: Marcos Sanches <marcos_sanches@gallup.com>
Subject: Re: About measures of distanses in cluster analysis
In-Reply-To: <3C9A1C98.59889957@fibertel.com.ar>
Content-Type: text/plain; charset="iso-8859-1"
Hector,
This subject is very interestig for me, I would like to ask you a question.
When I have categorical variables I use to run a CLUSTER on the scores of a
previous multiple correspondence analysis (HOMALS) performed over these
categorical variables. What do you think about this?
Marcos
-----Original Message-----
From: SPSSX(r) Discussion [mailto:SPSSX-L@LISTSERV.UGA.EDU]On Behalf Of
Hector Maletta
Sent: Thursday, March 21, 2002 2:47 PM
To: SPSSX-L@LISTSERV.UGA.EDU
Subject: Re: About measures of distanses in cluster analysis
Eugen,
Hierarchical cluster analysis (procedure CLUSTER in SPSS) is primarily
intended for variables measured at the interval level. Nominal variables
should not be used, though certain tricks allow you to do it anyway.
When you use the CLUSTER procedure in SPSS it lets you choose one
similarity distance among 37 options, some of them adequate for nominal
polithomous or binary variables. (Similarity is complementary to
distance, since it is a measure of the closeness of the cases or
variables you intend to group into clusters).
CLUSTER admits two kinds of data: individual raw data (cases by
variables) or similarity matrices. If you are intent on forming clusters
of cases (grouping similar cases together), the matrix input is a
(usually symmetric) matrix of n cases x n cases, showing the similarity
measure of each pair of subjects; if you are looking for clusters of
variables (grouping similar variables together) then you need a matrix
of k variables x k variables. Your question is evidently about grouping
cases.
Cases are deemed "similar" if they have similar values for all or most
of the variables involved in the analysis. Since your three nominal
variables have 5, 3 and 3 values respectively, you have already your
cases grouped into 5x3x3=45 homogeneous "clusters". Can you improve on
that, grouping these 45 groups into bigger aggregates? CLUSTER can do
it, if you specify a criterion.
Chi square is based on the idea of independence between the cases. If
this is true, the position of any case (i.e. its combination of values
of the three variables) is independent of the position of other cases.
Chi square equals zero if this is perfectly true, and more than zero as
cases are more and more correlated with each other. Phi is a normalized
version of chi square (divided by N).
Another approach for clustering cases based on nominal variables is the
Answer Tree separate software distributed also by SPSS. It groups
different combinations of values together based on an external criterion
variable. For instance, suppose your nominal variables are
neighbourhood, profession, and nationality, and you wish to form
homogeneous groups in terms of income. With 3 neighbourhoods, 5
professions and 3 nationalities you'd have up to 45 homogeneous groups.
Answer Tree will put together those elementary groups that do not
significantly differ in income. Of course, using another criterion (such
as years of education, age or whatever) would result in a different
grouping.
For details about the CLUSTER command see the Syntax Reference Manual
which is included in your installation CD and is probly present in your
hard disk as a file named spssbase.pdf (readable with Adobe Acrobat
Reader, that you can download easily from many websites including its
maker, www.adobe.com).
Hope this helps.
Hector Maletta
Universidad del Salvador
Buenos Aires, Argentina
Евгений Большов wrote:
>
> Hello dear SPSS list-members!
>
> I've encountered problem with choosing the most efficient and
theoretically
> proper measure
> of distances between objects in cluster analysis.
> I have three variable that are measured in nominal scale: first variable
has
> three possible value,
> second one has five possible values and third one has also three possible
> values.
> What I need to do is cluster analysis on the ground of this three
variables.
> What kind of measure of distances should I use to do this?
> And who can give me an explanations how the Chi-Square and Phi-Square
works?
> Should I use the Chi-Square measure of distances or it would be better to
> transform
> my data into binary variables and use measure of distances for this kind
of
> scale?
> Any information on this topic would be extremely useful.
>
> Thank you in advance.
> Eugen Bolshov
|