Date: Wed, 2 Jul 2003 10:47:19 -0700
Reply-To: Bin <bztt@MSN.COM>
Sender: "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From: Bin <bztt@MSN.COM>
Organization: http://groups.google.com/
Subject: clustering with known class
Content-Type: text/plain; charset=ISO-8859-1
Hi, all,
I hope this is the right list. if not, I am sorry to bother you.
My task is to cluster a set of protein sequence(for example, 3000
sequences, I can get the distance matrix of them). some sequence(1000
sequences) are the known class(protein fold). others(2000 sequences)
are the unknown class. it does not belong to supervised
classification. there are a lot of new class among the sequences.
My question is how can i take the advantage of known class to get the
number of class?
I want to do systematical trial to cluster for all of 3000
sequences(try 200 class, 300,...), and calculate the purity or entropy
of the known class, so I can choose the clustering result which has
reasonable purity/entropy.
I would appreciate it if you can give me some suggestion.
Bin
|