|Date: ||Thu, 7 Jul 2005 16:27:15 +0100|
|Reply-To: ||Ian Wakeling <ian.wakeling@HANANI.QISTATS.CO.UK>|
|Sender: ||"SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>|
|From: ||Ian Wakeling <ian.wakeling@HANANI.QISTATS.CO.UK>|
|Subject: ||Re: Cluster analysis for binary data|
|Content-Type: ||text/plain; charset="iso-8859-1"|
> >>> Susie Li <Susie.Li@TVGUIDE.COM> 7/7/2005 8:21:29 AM >>>
> With nominal and binary data, you are better off using
> regression instead of clustering, because you are violating too many
> clustering assumptions.
> With nominal data, you need to do some data transformation (changing
> them to binary) before logistic regressions.
"Peter Flom" <flom@NDRI.ORG replied
> Logistic regression is not really a substitute for cluster analysis, as
> far as I can see. In logistic regression (whether binary or multinomial
> logistic) you need to know the categories BEFORE you start the analysis.
> With cluster analysis, you are attempting to determine the number of
> categories and which subjects go into which cluster.
> Can cluster analysis be done with binary data?
> Well, I am no expert; the OP might want to search the archives of
> SAS-L, I think this has been discussed before. Also, the OP might want
> to write to CLASS-L, which is all about classification and clustering.
> It's not very busy, but it's there.
> The problem I see is not with the clustering method, but with the
> determination of distance. But that's just a gut feeling, not backed by
> literature or research.
> I'd be interested in hearing what the statistics experts on this list
> think about this.
I don't claim to be an expert, however I think Peter is right. If I look in the
SAS sample library I have the file
C:\Program Files\SAS\SAS 9.1\stat\sample\distanx2.sas
that contains an interesting example of clustering with binary data on
conditions for divorce in US states. It uses the new PROC DISTANCE
procedure to compute a Jacard Dissimilarity Coefficient. With this
sort of data it's important to decide if zero-zero matches are important
or not as this influences the choice of distance measure.