LISTSERV at the University of Georgia
Menubar Imagemap
Home Browse Manage Request Manuals Register
Previous (more recent) messageNext (less recent) messagePrevious (more recent) in topicNext (less recent) in topicPrevious (more recent) by same authorNext (less recent) by same authorPrevious page (September 1998)Back to main SPSSX-L pageJoin or leave SPSSX-L (or change settings)ReplyPost a new messageSearchProportional fontNon-proportional font
Date:         Wed, 2 Sep 1998 17:47:03 -0300
Reply-To:     Emiliano Maletta <hmaletta@OVERNET.COM.AR>
Sender:       "SPSSX(r) Discussion" <SPSSX-L@UGA.CC.UGA.EDU>
From:         Emiliano Maletta <hmaletta@OVERNET.COM.AR>
Subject:      Re: weighting data
Comments: To: "Popovic, Jennifer, VBAVACO" <ormjpopo@VBA.VA.GOV>
Content-Type: text/plain; charset=iso-8859-1

Dear Jennifer, I don't know whether I can answer all your questions. Take this only as a few comments. For the sake of clarity, let: N = Total population size Ni = Population group i n = Total planned sample (700 x 57) ni = Planned sample group i (700 cases, you say) qi = Effective sample interviewed in group i (average about 65% of the 700 cases, varying between groups) q = Total effective sample interviewed across all groups (about 0.65 x 700 x 57).

You extract ni cases from population Ni by random procedures. Let us assume it is a SRS (if it's not, its error would depend on sample design). The question now is: Is nonresponse randomly distributed within each group, or it is associated with some known variable such as age, gender, socioeconomic status or suchlike? a) If nonresponse is random within each group, then qi would be a SRS of Ni and you could proceed on that assumption. Weights would be Ni/qi (same weight for all individuals in the group). b) If nonresponse is somewhat selective and it is a function of X, W, Z, and moreover you know the distribution of X, W, Z within each group in the population (say, from census data), you could use specific weights for each sub-subgroup, Nijkm/qijkm where jkm is a combination of values of X,W,Z (eg male, 30-39 yrs old, lower status). You could test the randomness of nonresponse patterns by comparing profiles of respondents with population profiles (assuming you know which variables X, W,Z are relevant, and have data on all of them). Notice that you'd never know whether your subset of effective respondents is somehow biased in your variable of interest, as compared with the overall population of the group they come from, for even census data are not likely to contain the question you're asking. You could only ascertain whether they are similar or different in terms of background variables such as gender, age, education and the like. If these background variables are good predictors of your dependent variable, you can live with this. c) If nonresponse is not randomly distributed, and you're not able to 'explain' the nonresponse pattern by means of any set of explanatory variables X,W,Z, then you have a problem. There is some literature on the subject of statistical analysis with missing data (see Little & Rubin's book of that title, Wiley & Sons) and some on nonresponse too (though these refuse to come to my mind right now). I would delve not further on this alternative. Now, turning to statistical significance: using the aforementioned weights, your estimates for the whole population of the 57 groups taken together would have an error distribution which may no longer correspond to a SRS of size q. This is so because you are using a stratified sample (57 strata), and (if you considered the X,W,Z variables) even more than that number of strata, since each possible combination (group i, gender j, age group k and status m) would be a stratum. The standard error for a stratified sample is smaller than the corresponding error in a simple random sample. Apart from this, you may be using some other trick of sample design within each group (remember we ASSUMED each group sample was simple random), which would further complicate the case).

SPS computes statistical significance for the WEIGHTED data set. If your weights are of the form N/q or N/n, the resulting figures would be of the same order of magnitude as the total population (say, millions of people) even if your sample is only a few hundreds or thousands of cases. Therefore, if you apply those weights and ask for statistical significance tests or confidence intervals, SPSS would be assuming your 'sample' is of size N, not of size q or n. Consequently, it would grossly overstate the level of significance and grossly understate the error of the estimation.

If your sample could be considered as a SRS, you could create a set of weights that preserve the sample overall size, while reflecting different selection probabilities. This could be achieved if the original weights wi=Ni/qi (or Nijkm/qijkm) are multiplied by q/N, where N is total population size and q total sample size. The resulting weights would be zi=(Ni/qi)*(q/N). This way, all your frequency tables would total q cases (not N cases), but each individual case may count as somewhat less or somewhat more than 1, depending on its selection probability being respectively higher or lower than the average. The significance and confidence intervals computed by SPSS would be accurate for a SRS of size q, which is what you have.

If your sample is a stratified one, its error would be less than that. If you can estimate how large an SRS would have to be to yield the same standard error yielded by your stratified sample, say h instead of n, then you could correct your weights multiplying by h/q. Tables would show a total count of h cases, and significance levels computed by SPSS would correspond to a SRS of size h, which is in turn equivalent to a stratified sample of size q.

Stratification reduces samplñing error, but clustering increases it. If you have some clustering within your groups (e.g. you randomly choose some particular districts or wards or areas or schools and then proceed to obtain your group samples by random selection within the selected clusters), your sampling error is correspondingly higher. You may estimate the overall effect of this, and proceed likewise: estimate the size p of a SRS yielding the same error of your effective (clustered) sample, and do as instructed in the preceding paragraph. Of course, stratification and clustering could be present simultaneously.

Of course, an exact computation of sampling errors for complex sampling design is NOT the same as computing the error of an hypothetical SRS of larger size but equivalent error, if anything because the equivalence is not easily estimated. But this way you could use SPSS to compute significance in a fashion that may overstate or understate your significance levels but (ordinarily) not by much.

Remember, finally, that you better obtain your frequency tables (or any statistical output where absolute totals are relevant) using weights that expand the sample up to the effective population size N, but for significance levels you should use weights corrected to your effective sample size or to the hypothetical SRS having the same error as yours. Thus, if you ask for a crosstabulation expanded to N cases, the chi-square probability accompanying the crosstabulation in SPSS will not be valid; conversely, when you weight without expanding to N cases, your chi-square probability would be right but the table would appear to refer to just q (or h) cases. Therefore, you would have to run the procedure twice, once for the table and then for the coefficients, using the appropriate weights each time.

Recall, also, that these tricks using SPSS are not exactly orthodox. SPSS assumes all samples to be simple random, thus all we can hope for is not to be wrong by much. To compute sampling errors from complex sampling designs you just need a different software (such as WesVar).

Hope this helps.

Hector Maletta Universidad del Salvador Buenos Aires, Argentina

Popovic, Jennifer, VBAVACO wrote:

> Hector, > > I am a member of the SPSS discussion list and am writing to you because > I remember you discussing weighted data with a few persons on the list a > couple of months ago. The explanations you gave were wonderful, and I > wondered if you might be able to answer a question for me! > > I've gathered data from 57 different subgroups of a population. Some of > these subgroups comprise more of the pop than others, but because I > wanted to talk about each subgroup on its own, I surveyed about 700 > people from each subgroup. Now I want to roll the data up and talk > about the entire pop (all subgroups together); I know I need to apply a > weight to the data that is the inverse of each subgroup's probability of > selection so that the larger subgroups are counted more than the smaller > subgroups. > > If I also wanted to weight the data to correct for nonresponse (I had > about a 65 percent response rate), I know I'd have to compute a weight > that was equaled to [(subpopulation N/pop N)/(sample subpop/sample N)]. > This would weight my sample according to known population distributions > on whatever key variable(s) I chose. > > Now comes my question: If I wanted a weight that would correct for > unequal probability of selection and nonresponse, would I just multiply > these two weights together? > And at that point, my point estimates would be accurate, but confidence > intervals would not be and I would not be able to do significance tests > with my data anymore because my stat package (SPSS) would be assuming a > SRS design, which mine is not, so my standard errors would in fact be > smaller than what SPSS is actually showing me? > > Any advice you could give would be appreciated! Thank you... > > Jennifer R. Popovic > Survey Statistician > Surveys and Research Staff > Veterans Benefits Administration > ormjpopo@vba.va.gov > >


Back to: Top of message | Previous page | Main SPSSX-L page