LISTSERV at the University of Georgia
Menubar Imagemap
Home Browse Manage Request Manuals Register
Previous (more recent) messageNext (less recent) messagePrevious (more recent) in topicNext (less recent) in topicPrevious (more recent) by same authorNext (less recent) by same authorPrevious page (March 2002)Back to main SPSSX-L pageJoin or leave SPSSX-L (or change settings)ReplyPost a new messageSearchProportional fontNon-proportional font
Date:         Thu, 21 Mar 2002 16:14:27 -0300
Reply-To:     Marcos Sanches <marcos_sanches@gallup.com>
Sender:       "SPSSX(r) Discussion" <SPSSX-L@LISTSERV.UGA.EDU>
From:         Marcos Sanches <marcos_sanches@gallup.com>
Subject:      Re: About measures of distanses in cluster analysis
Comments: To: hmaletta@fibertel.com.ar
In-Reply-To:  <3C9A1C98.59889957@fibertel.com.ar>
Content-Type: text/plain; charset="iso-8859-1"

Hector,

This subject is very interestig for me, I would like to ask you a question. When I have categorical variables I use to run a CLUSTER on the scores of a previous multiple correspondence analysis (HOMALS) performed over these categorical variables. What do you think about this?

Marcos

-----Original Message----- From: SPSSX(r) Discussion [mailto:SPSSX-L@LISTSERV.UGA.EDU]On Behalf Of Hector Maletta Sent: Thursday, March 21, 2002 2:47 PM To: SPSSX-L@LISTSERV.UGA.EDU Subject: Re: About measures of distanses in cluster analysis

Eugen, Hierarchical cluster analysis (procedure CLUSTER in SPSS) is primarily intended for variables measured at the interval level. Nominal variables should not be used, though certain tricks allow you to do it anyway. When you use the CLUSTER procedure in SPSS it lets you choose one similarity distance among 37 options, some of them adequate for nominal polithomous or binary variables. (Similarity is complementary to distance, since it is a measure of the closeness of the cases or variables you intend to group into clusters). CLUSTER admits two kinds of data: individual raw data (cases by variables) or similarity matrices. If you are intent on forming clusters of cases (grouping similar cases together), the matrix input is a (usually symmetric) matrix of n cases x n cases, showing the similarity measure of each pair of subjects; if you are looking for clusters of variables (grouping similar variables together) then you need a matrix of k variables x k variables. Your question is evidently about grouping cases. Cases are deemed "similar" if they have similar values for all or most of the variables involved in the analysis. Since your three nominal variables have 5, 3 and 3 values respectively, you have already your cases grouped into 5x3x3=45 homogeneous "clusters". Can you improve on that, grouping these 45 groups into bigger aggregates? CLUSTER can do it, if you specify a criterion. Chi square is based on the idea of independence between the cases. If this is true, the position of any case (i.e. its combination of values of the three variables) is independent of the position of other cases. Chi square equals zero if this is perfectly true, and more than zero as cases are more and more correlated with each other. Phi is a normalized version of chi square (divided by N). Another approach for clustering cases based on nominal variables is the Answer Tree separate software distributed also by SPSS. It groups different combinations of values together based on an external criterion variable. For instance, suppose your nominal variables are neighbourhood, profession, and nationality, and you wish to form homogeneous groups in terms of income. With 3 neighbourhoods, 5 professions and 3 nationalities you'd have up to 45 homogeneous groups. Answer Tree will put together those elementary groups that do not significantly differ in income. Of course, using another criterion (such as years of education, age or whatever) would result in a different grouping.

For details about the CLUSTER command see the Syntax Reference Manual which is included in your installation CD and is probly present in your hard disk as a file named spssbase.pdf (readable with Adobe Acrobat Reader, that you can download easily from many websites including its maker, www.adobe.com). Hope this helps.

Hector Maletta Universidad del Salvador Buenos Aires, Argentina Евгений Большов wrote: > > Hello dear SPSS list-members! > > I've encountered problem with choosing the most efficient and theoretically > proper measure > of distances between objects in cluster analysis. > I have three variable that are measured in nominal scale: first variable has > three possible value, > second one has five possible values and third one has also three possible > values. > What I need to do is cluster analysis on the ground of this three variables. > What kind of measure of distances should I use to do this? > And who can give me an explanations how the Chi-Square and Phi-Square works? > Should I use the Chi-Square measure of distances or it would be better to > transform > my data into binary variables and use measure of distances for this kind of > scale? > Any information on this topic would be extremely useful. > > Thank you in advance. > Eugen Bolshov


Back to: Top of message | Previous page | Main SPSSX-L page