Date: Wed, 21 Sep 2005 17:04:01 -0400
Reply-To: Talbot Michael Katz <topkatz@MSN.COM>
Sender: "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From: Talbot Michael Katz <topkatz@MSN.COM>
Subject: Re: Cluster node for classification purpose
Clustering is not typically used to classify a target variable directly.
Clustering is sometimes used to define a target variable; your target
value is the id of the cluster you belong to. Then you can build a model
to predict cluster membership.
Sometimes clustering is used in conjunction with classification of a
target variable; highly separated clusters may indicate that separate
classification models should be built within each cluster.
And then there is "targeted clustering." Usually this involves a
continuous dependent variable that has roughly qualitative breaks at
different levels, and a complex, nonlinear relationship with the set of
predictor variables; including the dependent variable as one of the
clustering variables may help identify the qualitative break levels of the
dependent variable (which could be used to create a new, discrete
dependent variable). Or the clusters may indicate subpopulations for
which the dependent variable can be modeled linearly. However, there is
no guarantee that the targeted clusters will achieve either aim; it's a
trial and error sort of process.
So I don't think that your question 1 is well-defined. The clusters that
you build may not classify your target variable well, even if you include
the target variable as one of the cluster variables. There are several
techniques for classifying targets with more than two values, and some of
them are available in Enterprise Miner. If you ask real nicely, maybe
Peter Flom will send you his paper on multinomial logit modeling in SAS
(using SAS/STAT, not E-Miner). And don't forget good old-fashioned
discriminant analysis (PROC DISCRIM), which seems to have fallen out of
favor, but is intuitive and easy to carry out.
Question 2 is a very interesting one. This is a subtle point that applies
generally to modeling techniques. People like to say, "If the
distribution of your scoring population is different than the distribution
of your modeling population, then the model does not apply." But that's
not exactly true. Think about it this way (and, please, somebody correct
me if I've got this wrong). Suppose I have two identically distributed
populations, A and B. I build a model for population A, I score
population B, no problem. But suppose, instead of scoring all of B, I cut
away the tails. Then I may actually have decreased the standard errors on
my scored population! Even if I look at a skewed subset of the B
population, the standard errors of the group means may inflate, but the
standard errors of the individual predictions will remain the same. Where
you run into trouble is if the B population distribution falls
significantly outside of the A population distribution (sometimes you hear
this referred to as "out-of-sample").
-- TMK --
"The Macro Klutz"
On Wed, 21 Sep 2005 06:40:39 -0700, pa pa <ctll04@YAHOO.COM> wrote:
>In the SAS Enterprise Miner, there is clustering node. I would like to
use it for classification. My dataset has a target variable with 5
classes/values (A,B,C,D and E). I fed the dataset into the Cluster node.
>However, it only group the inputs into groups. But this grouping is not
same as the target attribute which consists of 5 classes (I know
clustering is un-supervised learning, and it doesnt need to know the
labels of the training set ). Particularly, theere are more than 5
>Q1: How can I know which class (A,B,C,D,E) a cluster belongs to?
>Q2: Is that true the clustering techniques assume the distribution of
A,B,C,D and E during training. So if the test set (after finish training)
does not follow this assumed distribution, the clustering technique will
not work properly?
>Have a nice day
>Yahoo! for Good
> Click here to donate to the Hurricane Katrina relief effort.