Date: Tue, 20 Sep 2005
ChiaoWen Hsiao
SPSSX(r) Discussion
ChiaoWen Hsiao
Subject: Re: sample size for cluster analysis
ContentType: text/plain; charset=USASCII
Hector: Thank you so much for the suggestions. I tried to run the analysis using different combination of variables. All the analyses showed that our sample fell in three groups, and the three groups have meaningful and distinct differences in the variables of interests. In this case, I suppose we can use almost all of the variables but will need to carefully choose which ones to be included based on the theroy.
Joyce
>>> Hector Maletta 9/20/2005 5:17:33 PM >>>
There is no specific rule for this. In linear regression usually a rule is
circulated requiring at least 1020 cases per variable. Based on this rule,
you should use a maximum of 5 variables, possibly extensible to 10
variables.
But it all depends on the variability among your cases. If your 100 cases
fall neatly within a few groups, and the variables are highly correlated
among themselves, then you may use more variables and still get meaningful
results (i.e. meaningful groups of cases). But if your cases are dispersed
across all values and combinations of values of the various variables, you
may as well form three clusters or thirty clusters, use four variables or
forty variables...
The general objective of a cluster analysis is to construct a few groups or
clusters that are (a) internally homogeneous and (b) clearly distinct from
other groups. If the groups are more or less equally distributed all over
the variablespace, many will fall in the "gray area", more or less at an
equal distance from various cluster centers, and thus attributing those
cases to one cluster or to another would be essentially arbitrary, and all
solutions would be highly unstable (changing even slightly the value of a
case in some of the variables would throw it into a different cluster). In
that kind of situation, larger samples (and larger cases/variables ratios)
would be needed.
Hector
> Original Message
> From: ChiaoWen Hsiao
> Sent: Tuesday, September 20, 2005 5:42 PM
> To: hmaletta@fibertel.com.ar
> Subject: RE: sample size for cluster analysis
>
> What is the general rule of thumb for determining sample size
> in cluster analysis? Is there any books/articles out there
> that you would recommend? Thanks.
>
> I am trying to reduce the number of variables based on our
> theory. What would be the acceptable number? Is 12 variables
> acceptable?
>
> Thank you so much!
>
> Joyce
>
>
>
>
>
> >>> "Hector Maletta" 9/20/2005 4:29:13 PM >>>
> 4:29:13 PM >>>
> It is probably too small a sample, and probably (even for a
> somewhat larger
> sample) 20 is too many clustering variables. Most probably
> some of the 20 are strongly correlated with other variables
> in the set, and thus redundant.
> Try to think of a few of the most essential variables you
> want to classify cases by (perhaps one for each conceptual
> dimension you are trying to cover), and run cluster analysis
> with those variables only.
>
> Using factor analysis to reduce your 20 variables to a few
> underlying factors, and then use the resulting factor scores
> for the clustering, may also in theory be a solution, but 100
> cases are too few also for factor analysis of 20 variables.
>
> Hector
>
>
>
> > Original Message
> > From: SPSSX(r) Discussion
> On Behalf
> > Of ChiaoWen Hsiao
> > Sent: Tuesday, September 20, 2005 5:02 PM
> > To: SPSSXL@LISTSERV.UGA.EDU
> > Subject: sample size for cluster analysis
> >
> > Hi all,
> >
> > I am running a cluster analysis with 20 variables in a
> sample of 100
> > participants. Is the sample size too small? Should I try to
> reduce the
> > number of variables? This is my first time running cluster
> analysis.
> > Any help would be greatly appreciated! Thank you!!
> >
> > Joyce
> >
>
>
>
>
