LISTSERV at the University of Georgia
Menubar Imagemap
Home Browse Manage Request Manuals Register
Previous messageNext messagePrevious in topicNext in topicPrevious by same authorNext by same authorPrevious page (September 2005, week 3)Back to main SAS-L pageJoin or leave SAS-L (or change settings)ReplyPost a new messageSearchProportional fontNon-proportional font
Date:         Wed, 21 Sep 2005 17:04:01 -0400
Reply-To:     Talbot Michael Katz <topkatz@MSN.COM>
Sender:       "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From:         Talbot Michael Katz <topkatz@MSN.COM>
Subject:      Re: Cluster node for classification purpose
Comments: To: Patrick Tran <ctll04@YAHOO.COM>

Hi, Patrick.

Clustering is not typically used to classify a target variable directly.

Clustering is sometimes used to define a target variable; your target value is the id of the cluster you belong to. Then you can build a model to predict cluster membership.

Sometimes clustering is used in conjunction with classification of a target variable; highly separated clusters may indicate that separate classification models should be built within each cluster.

And then there is "targeted clustering." Usually this involves a continuous dependent variable that has roughly qualitative breaks at different levels, and a complex, nonlinear relationship with the set of predictor variables; including the dependent variable as one of the clustering variables may help identify the qualitative break levels of the dependent variable (which could be used to create a new, discrete dependent variable). Or the clusters may indicate subpopulations for which the dependent variable can be modeled linearly. However, there is no guarantee that the targeted clusters will achieve either aim; it's a trial and error sort of process.

So I don't think that your question 1 is well-defined. The clusters that you build may not classify your target variable well, even if you include the target variable as one of the cluster variables. There are several techniques for classifying targets with more than two values, and some of them are available in Enterprise Miner. If you ask real nicely, maybe Peter Flom will send you his paper on multinomial logit modeling in SAS (using SAS/STAT, not E-Miner). And don't forget good old-fashioned discriminant analysis (PROC DISCRIM), which seems to have fallen out of favor, but is intuitive and easy to carry out.

Question 2 is a very interesting one. This is a subtle point that applies generally to modeling techniques. People like to say, "If the distribution of your scoring population is different than the distribution of your modeling population, then the model does not apply." But that's not exactly true. Think about it this way (and, please, somebody correct me if I've got this wrong). Suppose I have two identically distributed populations, A and B. I build a model for population A, I score population B, no problem. But suppose, instead of scoring all of B, I cut away the tails. Then I may actually have decreased the standard errors on my scored population! Even if I look at a skewed subset of the B population, the standard errors of the group means may inflate, but the standard errors of the individual predictions will remain the same. Where you run into trouble is if the B population distribution falls significantly outside of the A population distribution (sometimes you hear this referred to as "out-of-sample").

-- TMK -- "The Macro Klutz"

On Wed, 21 Sep 2005 06:40:39 -0700, pa pa <ctll04@YAHOO.COM> wrote:

>Hi there, >In the SAS Enterprise Miner, there is clustering node. I would like to use it for classification. My dataset has a target variable with 5 classes/values (A,B,C,D and E). I fed the dataset into the Cluster node. > >However, it only group the inputs into groups. But this grouping is not same as the target attribute which consists of 5 classes (I know clustering is un-supervised learning, and it doesnt need to know the labels of the training set ). Particularly, theere are more than 5 clusters generated. > >Q1: How can I know which class (A,B,C,D,E) a cluster belongs to? > >Q2: Is that true the clustering techniques assume the distribution of A,B,C,D and E during training. So if the test set (after finish training) does not follow this assumed distribution, the clustering technique will not work properly? > >Thanks >Have a nice day >Patrick Tran > > > >--------------------------------- >Yahoo! for Good > Click here to donate to the Hurricane Katrina relief effort.


Back to: Top of message | Previous page | Main SAS-L page