LISTSERV at the University of Georgia
Menubar Imagemap
Home Browse Manage Request Manuals Register
Previous messageNext messagePrevious in topicNext in topicPrevious by same authorNext by same authorPrevious page (September 2010, week 1)Back to main SAS-L pageJoin or leave SAS-L (or change settings)ReplyPost a new messageSearchProportional fontNon-proportional font
Date:   Wed, 1 Sep 2010 13:26:24 -0700
Reply-To:   Alex Tang <Alex.Tang@CREDITONE.COM>
Sender:   "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From:   Alex Tang <Alex.Tang@CREDITONE.COM>
Subject:   Re: interpreting clusters
Comments:   To: oloolo <dynamicpanel@YAHOO.COM>
In-Reply-To:   <201009011841.o81HlZrW018704@malibu.cc.uga.edu>
Content-Type:   text/plain; charset="us-ascii"

Yes, since I have many variables at the beginning, I think not only the curse of dimensionality is a concern, but also it's very likely many of them have multicollinearity problem. When reading SAS doc, a PCA or factor analysis is also recommended. But here what makes it different to choose either PCA or factor analysis?

Since you bring up the difference of difference dimension cluster analysis, why do you think hierarchical clustering is more suitable for variable clustering? I used to employ PROC VARCLUS as a step to group highly correlated variables in order to eliminate multicollinearity when building models. But I never know PROC CLUSTER can also do this. When I read SAS doc, all SAS mentioned in the help doc are just what you called subject clustering, just same as PROC FASTCLUS. It's just PROC FASTCLUS is k-mean clustering and PROC CLUSTER is hierarchical clustering. From some point of view, PROC CLUSTER is bit more flexible then PROC FASTCLUS.

Back to my question of interpretation of clusters. I haven't done so called subject clustering before, so I have to ask for opinions from you experts out here for opinions and experience. I think I totally agree to your opinion on the inefficiency of trying to define a cluster with a single attribute variable. So I would expect a multi-variable grouping, if there is and I can find it. Kelly have given some very good advises on grouping ranges of continuous variables and compare the mean and ratio across clusters. But still, as you also mentioned, since doing this way I would probably do one variable at a time, it may make me miss some information or mis-represent them. I agree it would be the best to look at multiple variables at the same time if it's feasible, but how? Any inputs on this?

-----Original Message----- From: oloolo [mailto:dynamicpanel@YAHOO.COM] Sent: Wednesday, September 01, 2010 11:41 AM To: SAS-L@LISTSERV.UGA.EDU; Alex Tang Subject: Re: interpreting clusters

First of all, hierarchical clustering and disjoint partition clustering typically used on different dimension of a typical customer profile data.

In my experience, hierarchical clustering is mostly used on the dimension of attributes (variables), while a disjoint partition clustering like k- means, is mostly used on the subject dimension. This is partially due to different computational requirement, as well as partially due to the ability to interprete the results. Of couse, there are cases that peopel conduct hierarchical clustering on both dimensions (heat map in bioinfo) and disjoint clustering on the variable dimension (like VARCLUS).

For segmentation, most widely used algorithm is still k-means clustering, I believe. Alex, you use information from a set of variables to generate your clusters based on the 'closeness' of a point to currently formed cluster centroids, this is the idea behind k-means algorithm that is implemented behind FASTCLUS. Thus, it is not very appropriete to look into only one variable because what segments this group out from the rest is not only, using your example, these subjects are relatively older, but also maybe they are more likely to live in urban area, and more likely they use public transportion, etc. You do need to look into multivariate attributes simultaneously.

That being said, it is also related how separated these clusters are, etc. If a few variables provides powerful separation capability for the data, focusing on only one of them may be doable.

When you want to apply your clustering analysis result to a new set of subject, you should take the mean profile by clusters as seed, and conduct a 1-nearest-neighbor search against this seed table to group your need subjects into already-found clusters. And this indeed can also provide a soft-clustering by looking into the relative distance of each new subject to all generated cluster controids. In this case, you use PROC SCORE. Also, related to the degree the set of subject can be seperated, you may also check out the std in addition to mean.

Last but not least, you need to take caution when you have a lot of variables to use in the clustering algorithm. Too many variables for a k- means clustering will cause curse of dimensionality, and the result is sparseness in the neighborhood of any give subject. This makes your result unstable. When you have so many variables, it is quite unlikely that they are mutual independent, thus it is good idea to use dimension reduction techniques first, say generate a bi-plot using PCA, etc. They will themselves serve as a first order indication of whether you can get good clustering result.

On Wed, 1 Sep 2010 10:57:57 -0700, Alex Tang <Alex.Tang@CREDITONE.COM> wrote:

>Kelly, thanks for your input. I am also thinking of this kind of manual >check and comparison. > >This way, we will have to pull the mean (and maybe also the range?) of >all the possible/available variable for each cluster and compare across >all the clusters, right? > >Now the challenge is to decide the combination of variable (and cutoff >point) to use for defining the clusters. Say, when I see cluster A have >a average age of 55, and all other clusters average 38, I would think >age could be one of variables to define cluster A. But what age should I >use as the cutoff value for cluster A? I think it could be anywhere >between 38 to 55, isn't it? Besides age, I will need to continue other >variable to see a possible cut to distinguish cluster A from others, >right? When it comes to multiple variables, is there anything I should >watch out for? > > > >-----Original Message----- >From: Thevenet-Morrison, Kelly >[mailto:Kelly_Thevenet-morrison@URMC.Rochester.edu] >Sent: Wednesday, September 01, 2010 10:47 AM >To: Alex Tang >Subject: RE: interpreting clusters >Sensitivity: Confidential > >If there is an output statement in fastclus to append the clusters to >your data you could create quick profiles to test differences for your >continuous variables using proc means with a class statement - your >segment or cluster number would be your class variable. See how >different they are. In the past I started with that and combined >clusters if they were not that different from one another. > >Kelly > >Kelly Thevenet-Morrison MS >Lead Programmer Analyst >Department of Community and Preventive Medicine >University of Rochester School of Medicine and Dentistry >601 Elmwood Ave., box 644 >Rochester, NY 14642 >Phone: 585-275-1817 >e-mail: kelly_thevenet-morrison@urmc.rochester.edu > > > >-----Original Message----- >From: SAS(r) Discussion [mailto:SAS-L@LISTSERV.UGA.EDU] On Behalf Of >Alex Tang >Sent: Wednesday, September 01, 2010 1:37 PM >To: SAS-L@LISTSERV.UGA.EDU >Subject: interpreting clusters >Sensitivity: Confidential > >We are segmenting our customers based on some profiles. Since we don't >have an explicit target right now, we are decided to do cluster analysis >at the time. There are about 2MM customers with 50-100 variables. > > > >I understand for such a big data set, I should have do PROC FASTCLUS to >get a relative big number of preliminary cluster set first, say, 100, >then import them to PROC CLUSTER for further analysis. If necessary, a >factor analysis or principal component analysis is deemed in the front >as well. > > > >When I am confused here is, suppose I get a final set of clusters here, >how do I interpret the clusters? It would be nice if I can describe the >clusters based on the profiles of the customers. E.g. cluster A is the >customers older than 40 years old and having an annual income less than >50k... something like that > > > >Or for the interpretation purpose, I should refer to approach other than >cluster analysis? Either, please advise. Thank you. > > > > > > > >******************* E-mail non-disclosure ****************** > >The information contained in this e-mail message may be proprietary >and/or confidential, and >protected from disclosure. If the reader of this message is not the >intended recipient, >or an employee or agent responsible for delivering this message to the >intended recipient, >you are hereby notified that any dissemination, distribution or copying >of this communication >is strictly prohibited. If you have received this communication in >error, please notify >Credit One Bank immediately by replying to this message and delete the >original message. Thank you. > > >******************* E-mail non-disclosure ****************** > >The information contained in this e-mail message may be proprietary and/or confidential, and >protected from disclosure. If the reader of this message is not the intended recipient, >or an employee or agent responsible for delivering this message to the intended recipient, >you are hereby notified that any dissemination, distribution or copying of this communication >is strictly prohibited. If you have received this communication in error, please notify >Credit One Bank immediately by replying to this message and delete the original message. Thank you.

******************* E-mail non-disclosure ******************

The information contained in this e-mail message may be proprietary and/or confidential, and protected from disclosure. If the reader of this message is not the intended recipient, or an employee or agent responsible for delivering this message to the intended recipient, you are hereby notified that any dissemination, distribution or copying of this communication is strictly prohibited. If you have received this communication in error, please notify Credit One Bank immediately by replying to this message and delete the original message. Thank you.


Back to: Top of message | Previous page | Main SAS-L page