LISTSERV at the University of Georgia
Menubar Imagemap
Home Browse Manage Request Manuals Register
Previous messageNext messagePrevious in topicNext in topicPrevious by same authorNext by same authorPrevious page (August 2004)Back to main SPSSX-L pageJoin or leave SPSSX-L (or change settings)ReplyPost a new messageSearchProportional fontNon-proportional font
Date:         Fri, 20 Aug 2004 17:42:02 -0400
Reply-To:     Richard Ristow <wrristow@mindspring.com>
Sender:       "SPSSX(r) Discussion" <SPSSX-L@LISTSERV.UGA.EDU>
From:         Richard Ristow <wrristow@mindspring.com>
Subject:      Re: Selecting a sample
Comments: To: Marta García-Granero <biostatistics@terra.es>
In-Reply-To:  <4825967683.20040820172452@terra.es>
Content-Type: text/plain; charset="iso-8859-1"; format=flowed

At 11:24 AM 8/20/2004, Marta García-Granero wrote:

>A colleague wants to select a sample of 250 cases out of a dataset of >several >thousands cases and about ten variables, but with the condition that >the mean & standard deviation of one of the variables (key variable) >of the sample be [essentially] identical to the mean & standard >deviation for that variable in the original dataset. > >This is not my field in SPSS, that's why I'm asking for help (worse >moment of the week, BTW).

I actually doubt that this is anybody's field in SPSS. The question I see is, what constraints are you willing to put on the sample, to ensure the desired result?

Two quick thoughts:

A. Mean and standard deviation for that variable (and the others) will be equal to the population value IN EXPECTATION for a true random sample. That might be good enough. You might argue to your colleague that it IS good enough, given that any other procedure distorts the sample to some degree.

B. Here's a possible method. It retains the property that each case has equal probability of being selected; it loses the property that selection of each case is independent of which other cases are selected:

Order the data by the key variable. Group it into 250 subsets by key variable size -- i.e., 0.4 percentile groups. Select at random one member of each group. This will *probably* give quite close fit to population mean and SD (but this is an intuitive, not a calculated, judgement), and have the additional advantage of modelling the original distribution, not just its mean and SD.

Outliers, of course, can have a greatly disproportional influence on the mean and SD of the set. If a very few large outliers affect the mean and SD, I'm not sure what you can do: including the outliers in the sample, and excluding them, will both probably give wrong results. Any analysis of a random subset of a dataset in which crucial variable are dominated by a few outliers is doubtful, though. (Jacknife methods?)

>(worst moment of the week, BTW)

(Wryly) I thought it was United States law that really difficult problems must be presented to a specialist who's finishing a long week, day, or shift. Has the European Union adopted this as well?

>Thanks, best regards and happy weekend to everybody

A happy weekend to you, too, Marta -- especially, happier than this!

-With respect and regards, Richard


Back to: Top of message | Previous page | Main SPSSX-L page