LISTSERV at the University of Georgia
Menubar Imagemap
Home Browse Manage Request Manuals Register
Previous messageNext messagePrevious in topicNext in topicPrevious by same authorNext by same authorPrevious page (July 2010, week 4)Back to main SAS-L pageJoin or leave SAS-L (or change settings)ReplyPost a new messageSearchProportional fontNon-proportional font
Date:         Wed, 28 Jul 2010 11:57:34 -0500
Reply-To:     Joe Matise <snoopy369@GMAIL.COM>
Sender:       "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From:         Joe Matise <snoopy369@GMAIL.COM>
Subject:      Re: Proc Surveyselect and PPS
Comments: To: "Nordlund, Dan (DSHS/RDA)" <NordlDJ@dshs.wa.gov>
In-Reply-To:  <941871A13165C2418EC144ACB212BDB0018C01DF@dshsmxoly1504g.dshs.wa.lcl>
Content-Type: text/plain; charset=ISO-8859-1

Thanks, Dan. I am double-checking with the folks I'm writing this for, but I think they need the sample to be complete - ie, if I want 12000 records and I can only pull 11400 records perfectly, they still want 12000 records, just as good as possible. I think your method could work by preprocessing the rates with a PROC FREQ of the dataset, and adjusting them to fit the necessary criteria.

I think the ultimate goal of the account group is to achieve a full sample (12000, in this example) with a balanced representation of potential respondents. It's not intended to have certain number of people in each bucket, I don't think, so much as to have a 'balanced' representation of people in the entire panel. Usually we are balancing to census numbers here - so, census proportions of age/sex/region/household income/etc. in order to try and have a balanced panel.

The weighting method does come much closer to the requested size; for example, in the below example adjusted to 12000 instead of 1200, your method gets 11240 while using rimweights gets 11742, and gets the frequencies much closer. I also have five criteria in the actual sample, not two; I'm not sure what effect that would have on the effectiveness of the strata (I'm pulling a large enough sample that all of the strata, even with 5 criteria, still have a good number of records in them). I will tailor your method to five criteria to see if it runs into any problems (primarily adding three more tables into the join, if I understand correctly).

If I understand how PROC SURVEYSELECT works, it is actually doing roughly what I'm doing - it is adjusting the probability of selection to appropriately select records; so there's not really a statistical difference in the two approaches, just that the rimweighting uses several passes through the data while SURVEYSELECT just uses a single set of buckets. Interestingly enough, the frequencies are quite different in their results - my method is short of females equally to males, and is short 14yo exclusively, while your method is short of 13 and 14yo, and entirely males:

My method: 11 1199 10.21 1199 10.21 12 1200 10.22 2399 20.43 13 2401 20.45 4800 40.88 14 3341 28.45 8141 69.33 15 2400 20.44 10541 89.77 16 1201 10.23 11742 100.00

F 7037 59.93 7037 59.93 M 4705 40.07 11742 100.00

Your method: 11 1200 10.68 1200 10.68 12 1200 10.68 2400 21.35 13 1962 17.46 4362 38.81 14 3278 29.16 7640 67.97 15 2400 21.35 10040 89.32 16 1200 10.68 11240 100.00

F 7200 64.06 7200 64.06 M 4040 35.94 11240 100.00

The rimweighting gets a bit closer if I ask it to be more precise, as I found out when I tested it (it's not solely the inadequate sample size). It definitely is getting closer to the desired results, though - 60% F, 40% M, etc., despite being short.

Thanks,

Joe

On Wed, Jul 28, 2010 at 3:52 AM, Nordlund, Dan (DSHS/RDA) < NordlDJ@dshs.wa.gov> wrote:

> > -----Original Message----- > > From: SAS(r) Discussion [mailto:SAS-L@LISTSERV.UGA.EDU] On Behalf Of > > Joe Matise > > Sent: Tuesday, July 27, 2010 4:16 PM > > To: SAS-L@LISTSERV.UGA.EDU > > Subject: Proc Surveyselect and PPS > > > > Hi folks, > > I'm helping someone in my company develop a sampling program that will > > select balanced samples according to various categories - ie, rather > > than > > have specific strata cell sizes, we want to set up overall proportion > > targets for each strata, but not specify the cell sizes directly. IE, > > we > > might be asked that a sample be 55% female, 45% male, certain > > percentages in > > each age group, certain percentages from each census region, and > > certain > > percentages from each income category. > > > > I thought it would be most efficient to first rimweight the sample, and > > then > > select a sample based on SIZE=(rimweight) from that dataset, using PPS. > > Sampling without replacement is necessary, I believe, as we don't want > > to > > select people twice. > > > > I however haven't been able to convince PROC SURVEYSELECT to use > > regular > > PPS. I obviously misunderstand something about CERTSIZE= or MAXSIZE=, > > or am > > just missing the boat entirely. What I currently do is something like > > this: > > > > data from_class; > > set sashelp.class; > > if _n_ = 1 then _r=1; > > x = floor(ranuni(7)*1500); > > do _t = 1 to x; > > recid=_r; > > _r+1; > > output; > > end; > > drop _:; > > run; > > > > data f_age; > > input age percent; > > datalines; > > 11 10 > > 12 10 > > 13 20 > > 14 30 > > 15 20 > > 16 10 > > ;;;; > > run; > > data f_sex; > > input sex $ percent; > > datalines; > > M 40 > > F 60 > > ;;;; > > run; > > > > > > > > options nonotes nomprint nosymbolgen nomlogic; > > %rakinge( /* call enhanced raking */ > > inds=from_class, > > outds=out_class, > > inwt= , > > freqlist=F_age > > f_sex, > > outwt=wgt, > > byvar=, > > varlist = age sex, > > numvar=2, > > cntotal=18701, > > trmprec=1, > > trmpct=,/*0.0001, macro will terminate based on this criterion */ > > numiter=250, > > prdiag= > > ); > > options notes; > > > > proc surveyselect data=out_class sampsize=12000 method=pps_seq > > out=selected_pass1; > > *proc surveyselect data=out_class sampsize=12000 method=pps_seq > > out=selected_pass1 certsize=1.55; > > *this does not work, nor does using maxsize=1.55 - which is (total > > size)/(samp size) or (18701/12000).; > > size wgt; > > run; > > proc freq data=selected_pass1; > > tables age sex; > > run; > > > > (Using the rakinge macro originally created by David Izrael and found > > in > > posts on SAS-L). > > > > Now, with an appropriate sampsize, say, 90, it works fine - there are > > enough > > of everything to get 90, no problem. However, with a sampsize that is > > higher, say 120, it refuses - presumably because it cannot find enough > > to > > get that 120 and still hold to the requirements. Some records would > > have a > > sampling probability greater than 1, and that causes it to fail. Using > > CERTSIZE or MAXSIZE doesn't seem to help (setting it to (total > > count)/(sample desired) or even smaller numbers still fails). > > > > If I use PPS_SEQ or PPS_SYS, it works, but underdraws the sample (as > > it's > > using replacement). One option I suppose would be to do this and then > > repull additional records to make up the difference, though I'm not > > entirely > > sure of the best way to do this while still maintaining as close as > > possible > > to the weighted targets. I could just rerun PROC SURVEYSELECT, but > > that > > generates a bit of risk - both that I might end up doing this five or > > six > > times, and that I do not get particularly identical proportions. > > > > For example: > > > > > > proc sort data=selected_pass1 out=selected_merge(keep=recid); > > by recid; > > run; > > proc sort data=out_class; > > by recid; > > run; > > > > > > data for_pass2; > > merge out_class(in=a) selected_merge(in=b); > > by recid; > > if not b; > > run; > > > > proc sql; > > select 12000 - count(1) into :topull from selected_pass1; > > quit; > > > > proc surveyselect data=for_pass2 out=selected_pass2 sampsize=&topull > > method=pps; > > size wgt; > > run; > > > > data selected_all; > > set selected_pass1 selected_pass2; > > run; > > proc freq data=selected_all; > > tables age sex; > > run; > > > > That gets me only 27% 14 year olds. I do have enough to get close at > > least > > - I have 3431 14 year olds, and only 3234 are being pulled - so clearly > > the > > weighting at this point just fails, because it's no longer particularly > > accurate. I can reweight, but then I'm very likely to have to do this > > repeatedly, as I will again have sample sizes that are not appropriate. > > > > Is there a better solution than rimweighting and then using PPS of some > > variety? Am I missing something on the correct use of MAXSIZE and/or > > CERTSIZE? > > > > Thanks, > > > > Joe > > Joe, > > I think what you want is actually stratified sampling (method=srs with the > strata statement). I don't have the time to figure out the raking macro, so > I modified your data generation slightly by adding data for age=16, sex=F. > I used your two datasets specifying age and sex percentages to create the > rate for sampling each stratum, and then calculated stratum size using the > rates and the total desired sample size. Then created a stratumsize dataset > to use with surveyselect. I presume the raking macro will either calculate > stratum size or sampling rate which can be substituted appropriately in my > process. (I only created the _stratum variable because I couldn't get > surveyselect to work with age and sex directly and I am too tired to figure > it out tonight.) If your calculated stratum size exceeds the available > case, you can tell surveyselect to take all the obs in the stratum by > uncommenting the selectall option below. So, something like this should > work. > > data from_class; > set sashelp.class; > length _stratum $3; > _stratum = put(age,2.)||sex; > > if _n_ = 1 then _r=1; > x = floor(ranuni(7)*1500); > do _t = 1 to x; > recid=_r; > _r+1; > output; > end; > **--adding records for age=16, sex=F--**; > if name EQ 'Barbara' then do _t = 1 to x; > name = 'Babs'; > age=16; > _stratum = put(age,2.)||sex; > recid=_r; > _r+1; > output; > end; > drop _r _t; > run; > > data f_age; > input age percent; > datalines; > 11 10 > 12 10 > 13 20 > 14 30 > 15 20 > 16 10 > ;;;; > run; > data f_sex; > input sex $ percent; > datalines; > M 40 > F 60 > ;;;; > run; > > **--determine wanted stratum sampling rates--**; > proc sql; > create table rates as > select a.age, a.percent/100 as a_pct, b.sex, b.percent/100 as b_pct > from f_age as a, f_sex as b > ; > quit; > > **--create secondary dataset with sample stratum sizes--**; > %let total_sample_size=1200; > > data stratumsize (keep= _stratum _nsize_); > set rates; > _nsize_ = round(a_pct*b_pct*&total_sample_size); > length _stratum $3; > _stratum = put(age,2.)||sex; > run; > > **--strata need to be in sort order--**; > proc sort data=stratumsize; > by _stratum; > run; > proc sort data=from_class; > by _stratum; > run; > > proc surveyselect > data=from_class > out=SampleStrata > method=srs > seed=1953 > n=stratumsize > /*selectall*/; > strata _stratum; > run; > > If I have misunderstood what you want, please clarify and I will try to > correct the process. > > Hope this is helpful, > > Dan > > Daniel J. Nordlund > Washington State Department of Social and Health Services > Planning, Performance, and Accountability > Research and Data Analysis Division > Olympia, WA 98504-5204 >


Back to: Top of message | Previous page | Main SAS-L page