Date: Fri, 20 Aug 2004 17:42:02 -0400
Reply-To: Richard Ristow <wrristow@mindspring.com>
Sender: "SPSSX(r) Discussion" <SPSSX-L@LISTSERV.UGA.EDU>
From: Richard Ristow <wrristow@mindspring.com>
Subject: Re: Selecting a sample
In-Reply-To: <4825967683.20040820172452@terra.es>
Content-Type: text/plain; charset="iso-8859-1"; format=flowed
At 11:24 AM 8/20/2004, Marta García-Granero wrote:
>A colleague wants to select a sample of 250 cases out of a dataset of
>several
>thousands cases and about ten variables, but with the condition that
>the mean & standard deviation of one of the variables (key variable)
>of the sample be [essentially] identical to the mean & standard
>deviation for that variable in the original dataset.
>
>This is not my field in SPSS, that's why I'm asking for help (worse
>moment of the week, BTW).
I actually doubt that this is anybody's field in SPSS. The question I
see is, what constraints are you willing to put on the sample, to
ensure the desired result?
Two quick thoughts:
A. Mean and standard deviation for that variable (and the others) will
be equal to the population value IN EXPECTATION for a true random
sample. That might be good enough. You might argue to your colleague
that it IS good enough, given that any other procedure distorts the
sample to some degree.
B. Here's a possible method. It retains the property that each case has
equal probability of being selected; it loses the property that
selection of each case is independent of which other cases are
selected:
Order the data by the key variable. Group it into 250 subsets by key
variable size -- i.e., 0.4 percentile groups. Select at random one
member of each group. This will *probably* give quite close fit to
population mean and SD (but this is an intuitive, not a calculated,
judgement), and have the additional advantage of modelling the original
distribution, not just its mean and SD.
Outliers, of course, can have a greatly disproportional influence on
the mean and SD of the set. If a very few large outliers affect the
mean and SD, I'm not sure what you can do: including the outliers in
the sample, and excluding them, will both probably give wrong results.
Any analysis of a random subset of a dataset in which crucial variable
are dominated by a few outliers is doubtful, though. (Jacknife
methods?)
>(worst moment of the week, BTW)
(Wryly) I thought it was United States law that really difficult
problems must be presented to a specialist who's finishing a long week,
day, or shift. Has the European Union adopted this as well?
>Thanks, best regards and happy weekend to everybody
A happy weekend to you, too, Marta -- especially, happier than this!
-With respect and regards,
Richard