Date: Wed, 28 Jul 2010 11:57:34 -0500
Reply-To: Joe Matise <snoopy369@GMAIL.COM>
Sender: "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From: Joe Matise <snoopy369@GMAIL.COM>
Subject: Re: Proc Surveyselect and PPS
In-Reply-To: <941871A13165C2418EC144ACB212BDB0018C01DF@dshsmxoly1504g.dshs.wa.lcl>
Content-Type: text/plain; charset=ISO-8859-1
Thanks, Dan. I am double-checking with the folks I'm writing this for, but
I think they need the sample to be complete - ie, if I want 12000 records
and I can only pull 11400 records perfectly, they still want 12000 records,
just as good as possible. I think your method could work by preprocessing
the rates with a PROC FREQ of the dataset, and adjusting them to fit the
necessary criteria.
I think the ultimate goal of the account group is to achieve a full sample
(12000, in this example) with a balanced representation of potential
respondents. It's not intended to have certain number of people in each
bucket, I don't think, so much as to have a 'balanced' representation of
people in the entire panel. Usually we are balancing to census numbers here
- so, census proportions of age/sex/region/household income/etc. in order
to try and have a balanced panel.
The weighting method does come much closer to the requested size; for
example, in the below example adjusted to 12000 instead of 1200, your method
gets 11240 while using rimweights gets 11742, and gets the frequencies much
closer. I also have five criteria in the actual sample, not two; I'm not
sure what effect that would have on the effectiveness of the strata (I'm
pulling a large enough sample that all of the strata, even with 5 criteria,
still have a good number of records in them). I will tailor your method to
five criteria to see if it runs into any problems (primarily adding three
more tables into the join, if I understand correctly).
If I understand how PROC SURVEYSELECT works, it is actually doing roughly
what I'm doing - it is adjusting the probability of selection to
appropriately select records; so there's not really a statistical difference
in the two approaches, just that the rimweighting uses several passes
through the data while SURVEYSELECT just uses a single set of buckets.
Interestingly enough, the frequencies are quite different in their results -
my method is short of females equally to males, and is short 14yo
exclusively, while your method is short of 13 and 14yo, and entirely males:
My method:
11 1199 10.21 1199 10.21
12 1200 10.22 2399 20.43
13 2401 20.45 4800 40.88
14 3341 28.45 8141 69.33
15 2400 20.44 10541 89.77
16 1201 10.23 11742 100.00
F 7037 59.93 7037 59.93
M 4705 40.07 11742 100.00
Your method:
11 1200 10.68 1200 10.68
12 1200 10.68 2400 21.35
13 1962 17.46 4362 38.81
14 3278 29.16 7640 67.97
15 2400 21.35 10040 89.32
16 1200 10.68 11240 100.00
F 7200 64.06 7200 64.06
M 4040 35.94 11240 100.00
The rimweighting gets a bit closer if I ask it to be more precise, as I
found out when I tested it (it's not solely the inadequate sample size). It
definitely is getting closer to the desired results, though - 60% F, 40% M,
etc., despite being short.
Thanks,
Joe
On Wed, Jul 28, 2010 at 3:52 AM, Nordlund, Dan (DSHS/RDA) <
NordlDJ@dshs.wa.gov> wrote:
> > -----Original Message-----
> > From: SAS(r) Discussion [mailto:SAS-L@LISTSERV.UGA.EDU] On Behalf Of
> > Joe Matise
> > Sent: Tuesday, July 27, 2010 4:16 PM
> > To: SAS-L@LISTSERV.UGA.EDU
> > Subject: Proc Surveyselect and PPS
> >
> > Hi folks,
> > I'm helping someone in my company develop a sampling program that will
> > select balanced samples according to various categories - ie, rather
> > than
> > have specific strata cell sizes, we want to set up overall proportion
> > targets for each strata, but not specify the cell sizes directly. IE,
> > we
> > might be asked that a sample be 55% female, 45% male, certain
> > percentages in
> > each age group, certain percentages from each census region, and
> > certain
> > percentages from each income category.
> >
> > I thought it would be most efficient to first rimweight the sample, and
> > then
> > select a sample based on SIZE=(rimweight) from that dataset, using PPS.
> > Sampling without replacement is necessary, I believe, as we don't want
> > to
> > select people twice.
> >
> > I however haven't been able to convince PROC SURVEYSELECT to use
> > regular
> > PPS. I obviously misunderstand something about CERTSIZE= or MAXSIZE=,
> > or am
> > just missing the boat entirely. What I currently do is something like
> > this:
> >
> > data from_class;
> > set sashelp.class;
> > if _n_ = 1 then _r=1;
> > x = floor(ranuni(7)*1500);
> > do _t = 1 to x;
> > recid=_r;
> > _r+1;
> > output;
> > end;
> > drop _:;
> > run;
> >
> > data f_age;
> > input age percent;
> > datalines;
> > 11 10
> > 12 10
> > 13 20
> > 14 30
> > 15 20
> > 16 10
> > ;;;;
> > run;
> > data f_sex;
> > input sex $ percent;
> > datalines;
> > M 40
> > F 60
> > ;;;;
> > run;
> >
> >
> >
> > options nonotes nomprint nosymbolgen nomlogic;
> > %rakinge( /* call enhanced raking */
> > inds=from_class,
> > outds=out_class,
> > inwt= ,
> > freqlist=F_age
> > f_sex,
> > outwt=wgt,
> > byvar=,
> > varlist = age sex,
> > numvar=2,
> > cntotal=18701,
> > trmprec=1,
> > trmpct=,/*0.0001, macro will terminate based on this criterion */
> > numiter=250,
> > prdiag=
> > );
> > options notes;
> >
> > proc surveyselect data=out_class sampsize=12000 method=pps_seq
> > out=selected_pass1;
> > *proc surveyselect data=out_class sampsize=12000 method=pps_seq
> > out=selected_pass1 certsize=1.55;
> > *this does not work, nor does using maxsize=1.55 - which is (total
> > size)/(samp size) or (18701/12000).;
> > size wgt;
> > run;
> > proc freq data=selected_pass1;
> > tables age sex;
> > run;
> >
> > (Using the rakinge macro originally created by David Izrael and found
> > in
> > posts on SAS-L).
> >
> > Now, with an appropriate sampsize, say, 90, it works fine - there are
> > enough
> > of everything to get 90, no problem. However, with a sampsize that is
> > higher, say 120, it refuses - presumably because it cannot find enough
> > to
> > get that 120 and still hold to the requirements. Some records would
> > have a
> > sampling probability greater than 1, and that causes it to fail. Using
> > CERTSIZE or MAXSIZE doesn't seem to help (setting it to (total
> > count)/(sample desired) or even smaller numbers still fails).
> >
> > If I use PPS_SEQ or PPS_SYS, it works, but underdraws the sample (as
> > it's
> > using replacement). One option I suppose would be to do this and then
> > repull additional records to make up the difference, though I'm not
> > entirely
> > sure of the best way to do this while still maintaining as close as
> > possible
> > to the weighted targets. I could just rerun PROC SURVEYSELECT, but
> > that
> > generates a bit of risk - both that I might end up doing this five or
> > six
> > times, and that I do not get particularly identical proportions.
> >
> > For example:
> >
> >
> > proc sort data=selected_pass1 out=selected_merge(keep=recid);
> > by recid;
> > run;
> > proc sort data=out_class;
> > by recid;
> > run;
> >
> >
> > data for_pass2;
> > merge out_class(in=a) selected_merge(in=b);
> > by recid;
> > if not b;
> > run;
> >
> > proc sql;
> > select 12000 - count(1) into :topull from selected_pass1;
> > quit;
> >
> > proc surveyselect data=for_pass2 out=selected_pass2 sampsize=&topull
> > method=pps;
> > size wgt;
> > run;
> >
> > data selected_all;
> > set selected_pass1 selected_pass2;
> > run;
> > proc freq data=selected_all;
> > tables age sex;
> > run;
> >
> > That gets me only 27% 14 year olds. I do have enough to get close at
> > least
> > - I have 3431 14 year olds, and only 3234 are being pulled - so clearly
> > the
> > weighting at this point just fails, because it's no longer particularly
> > accurate. I can reweight, but then I'm very likely to have to do this
> > repeatedly, as I will again have sample sizes that are not appropriate.
> >
> > Is there a better solution than rimweighting and then using PPS of some
> > variety? Am I missing something on the correct use of MAXSIZE and/or
> > CERTSIZE?
> >
> > Thanks,
> >
> > Joe
>
> Joe,
>
> I think what you want is actually stratified sampling (method=srs with the
> strata statement). I don't have the time to figure out the raking macro, so
> I modified your data generation slightly by adding data for age=16, sex=F.
> I used your two datasets specifying age and sex percentages to create the
> rate for sampling each stratum, and then calculated stratum size using the
> rates and the total desired sample size. Then created a stratumsize dataset
> to use with surveyselect. I presume the raking macro will either calculate
> stratum size or sampling rate which can be substituted appropriately in my
> process. (I only created the _stratum variable because I couldn't get
> surveyselect to work with age and sex directly and I am too tired to figure
> it out tonight.) If your calculated stratum size exceeds the available
> case, you can tell surveyselect to take all the obs in the stratum by
> uncommenting the selectall option below. So, something like this should
> work.
>
> data from_class;
> set sashelp.class;
> length _stratum $3;
> _stratum = put(age,2.)||sex;
>
> if _n_ = 1 then _r=1;
> x = floor(ranuni(7)*1500);
> do _t = 1 to x;
> recid=_r;
> _r+1;
> output;
> end;
> **--adding records for age=16, sex=F--**;
> if name EQ 'Barbara' then do _t = 1 to x;
> name = 'Babs';
> age=16;
> _stratum = put(age,2.)||sex;
> recid=_r;
> _r+1;
> output;
> end;
> drop _r _t;
> run;
>
> data f_age;
> input age percent;
> datalines;
> 11 10
> 12 10
> 13 20
> 14 30
> 15 20
> 16 10
> ;;;;
> run;
> data f_sex;
> input sex $ percent;
> datalines;
> M 40
> F 60
> ;;;;
> run;
>
> **--determine wanted stratum sampling rates--**;
> proc sql;
> create table rates as
> select a.age, a.percent/100 as a_pct, b.sex, b.percent/100 as b_pct
> from f_age as a, f_sex as b
> ;
> quit;
>
> **--create secondary dataset with sample stratum sizes--**;
> %let total_sample_size=1200;
>
> data stratumsize (keep= _stratum _nsize_);
> set rates;
> _nsize_ = round(a_pct*b_pct*&total_sample_size);
> length _stratum $3;
> _stratum = put(age,2.)||sex;
> run;
>
> **--strata need to be in sort order--**;
> proc sort data=stratumsize;
> by _stratum;
> run;
> proc sort data=from_class;
> by _stratum;
> run;
>
> proc surveyselect
> data=from_class
> out=SampleStrata
> method=srs
> seed=1953
> n=stratumsize
> /*selectall*/;
> strata _stratum;
> run;
>
> If I have misunderstood what you want, please clarify and I will try to
> correct the process.
>
> Hope this is helpful,
>
> Dan
>
> Daniel J. Nordlund
> Washington State Department of Social and Health Services
> Planning, Performance, and Accountability
> Research and Data Analysis Division
> Olympia, WA 98504-5204
>
|