Date: Thu, 17 Nov 2005 17:25:06 -0800
Reply-To: David L Cassell <davidlcassell@MSN.COM>
Sender: "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From: David L Cassell <davidlcassell@MSN.COM>
Subject: Re: PROC SQL Help
In-Reply-To: <9871F0C990DF9B4681F1EB312A4D02F80A263C0C@ucmail8.ad.uc.edu>
Content-Type: text/plain; format=flowed
BREWERDI@UCMAIL.UC.EDU wrote back to me personally instead of to SAS-L:
>David,
>
>Thanks for your kind suggestions.
>
>The one-to-one matching must be done by matching on begin_month, gender,
>racegroup, and agegroup. If it were only begin_month, I wouldn't need
>any elaborate coding scheme and I could use what you suggested.
>
>So I might have say 300 control records with month=7, but I need to
>match the other three criteria as well; if I do a merge, I will get a
>many-to-many or 1-to-many match, which would be erroneous. I don't
>understand SQL and am learning by the seat of my pants. I tried the
>merge and it was not working...too many matches, so I thought I would
>look into the sql route.
>
>For example, I might have 8 case records that match on gender, agegroup,
>and racrgroup; those eight records have one record each for
>begin_month=7,8,9, and 12 and four records for month=11. That is what I
>need to select from the control population (only 8 records, 1-to-1, per
>the PI...I am just a grunt following orders =:) ).
>
>If you have any additional suggestions, they would be greatly
>appreciated.
>
>Thanks again for your time and expertise.
Okay, now this is *exactly* why you want to muck about with PROC
SURVEYSELECT.
You're setting up properly to get your control for your cases. You say that
you
have exactly one case record for each of 8 different combinations of
begin_month,
gender, racegroup, and agegroup. If you don't feel comfortable doing this
in PROC
SQL, then do it in a data step (which just takes more steps) since you'll
have a 1:many
merge. If you were looking at a n:m merge, you couldn't attempt that as a
straight
data step merge. I'll show you the data step code. Sig or someone else
will no doubt
chime in with a single PROC SQL step to get the output. But since you say
you don't
grok SQL, I'll skip that for now.
proc sort data=cases out=cases2(keep=begin_month gender racegroup agegroup);
by begin_month gender racegroup agegroup;
run;
proc sort data=controls;
by begin_month gender racegroup agegroup;
run;
data combo;
merge cases2(in=incases) controls;
by begin_month gender racegroup agegroup;
if incases;
run;
Now all the above could be done in a single PROC SQL step. With n=4000,
you probably won't notice much of a difference in time needed, either way.
With a much larger data set, the PROC SL approach would transparently let
you toss the problem to the SL optmizer, which would probably use a hash
to pick out the relevant records before needing to do the sort at the end.
That would be faster.
But now that the data are trimmed down to the relevant 8 cases by our merge,
we can use PROC SURVEYSELECT to pick K controls (here we'll use 1) per
cell. We'll cheat and call these cells our strata, and do what is really a
stratified
sample.
proc surveyselect data=combo out=YourSample sampsize=1 seed=947374;
strata begin_month gender racegroup agegroup;
run;
That's it. It's fairly straightforward to perform case-control sampling
using this
proc, but it's not a well-known technique. Normally, people do it using
hideously
complex macros.
HTH,
David
--
David L. Cassell
mathematical statistician
Design Pathways
3115 NW Norwood Pl.
Corvallis OR 97330
_________________________________________________________________
Is your PC infected? Get a FREE online computer virus scan from McAfeeŽ
Security. http://clinic.mcafee.com/clinic/ibuy/campaign.asp?cid=3963
|