Date: Sat, 24 Jan 2009 18:14:52 -0500
Reply-To: "Howard Schreier <hs AT dc-sug DOT org>"
<schreier.junk.mail@GMAIL.COM>
Sender: "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From: "Howard Schreier <hs AT dc-sug DOT org>"
<schreier.junk.mail@GMAIL.COM>
Subject: Re: Merging many-to-many, efficiently
On Thu, 22 Jan 2009 18:46:53 -0600, Joe Matise <snoopy369@GMAIL.COM> wrote:
>Hi Sas-L,
>I have a dataset with question responses and a period value corresponding to
>the month the respondent was surveyed. I want to produce a dataset with
>multiple rows per respondent, one for each reporting period that respondent
>qualifies for - month, fiscal quarter, fiscal year, etc. - so I can do a
>proc means using period as a class variable. I have that information in a
>second dataset.
>
>This dataset will grow relatively large (50,000 respondents per month, up to
>18 months reported on at any given time, so 900,000 potential respondents,
>and up to 5 reporting periods they can qualify for each - up to 4.5m rows in
>the resulting dataset). I'm considering how potentially other ways of
>running the data might be more efficient, but for the moment let us say this
>is how I will go.
>
>What would be the most efficient way of merging my two datasets together (or
>otherwise assigning period type)? I've thought of three ways so far, but my
>tests even in those methods were inconclusive so far.
>
>Method 1: A merge. I can't actually get this to behave as I want it to
>(hence I can't really test it) and don't necessarily know that it will
>work. Merge does not seem to want to put out the records more than once
>each.
>
>Method 2: A dual set statement, where I set the master dataset, then do n=1
>by 1 until n=nobs using point=N and an if statement to limit the new dataset
>to only those obs. where the period is the period of the current record.
>~20 seconds for 50,000 respondents (one period only), so ~6 minutes for 900k
>respondents, I imagine. I'm also experimenting using KEY= but that doesn't
>seem to help (yet).
>
>Method 3: A macro-based solution where I store a bunch of IF statements, one
>for each period, that performs five outputs (as appropriate). Seems slow as
>you'd have 18 if statements evaluated per row of data, but then again method
>2 does process a much larger dataset.
>
>Method 4: Proc SQL join (see below)
>
>Sample data follows... I'm looking for most efficient solution that is not
>overly complicated (one or two datasteps preferably). I'm avoiding PROC SQL
>for the moment because I'm the only one in the office who can really
>understand it (and so nobody else would follow my code if I used it), but if
>that turns out to be better it can of course be used. I included method 2
>below, as that's the one I've managed to work out so far that seems
>practicable.
>
>data surveydata;
>drop i;
>do periodtype=200901 to 200906;
>do i = 1 to 50000;
> q1=ranuni(5);
> q2=ranuni(5);
> q3=ranuni(5);
> q4=ranuni(10);
> output;
>end;
>end;
>run;
>
>data calendar;
>input periodtype period $;
>cards;
>200901 0901
>200901 09Q1
>200901 09H1
>200901 09Y
>200902 0902
>200902 09Q1
>200902 09H1
>200902 09Y
>200903 0903
>200903 09Q1
>200903 09H1
>200903 09Y
>200904 0904
>200904 09Q2
>200904 09H1
>200904 09Y
>200905 0905
>200905 09Q2
>200905 09H1
>200905 09Y
>200906 0906
>200906 09Q2
>200906 09H1
>200906 09Y
>;;;;
>run;
>
>data testset;
>set surveydata;
>by periodtype;
>N=1;
>cperiodtype=periodtype;
>do N=1 by 1 until (N=nobsvar);
> set calendar point=N nobs=nobsvar;
> if periodtype=cperiodtype then output;
>end;
>run;
>
>
>That takes about 6 seconds for 1/3 of my potential rows, but I have ~150
>questions per row, so it takes rather longer, of course. I might convert
>the calendar dataset to horizontal, not vertical (so only one row per
>periodtype, with multiple period variables), but I'm not sure that is any
>more efficient (as I have a lot more statements to process then per row).
>
>Thanks in advance!
>
>-Joe
Here's a way to get what (I think) Joe wants without spawning more
observations and without building the CALENDAR table or a format.
First create CLASS variables for each of the calendar intervals:
data formeans / view=formeans;
set surveydata;
Month = input(put(periodtype,6.),yymmN6.);
Quarter = month;
Year = month;
length Half $ 6;
Half = cats( year(month) , 'H' , ceil(month(month)/6) );
format month monyy7. quarter yyq6. year year.;
run;
Notice that HALF is a character variable while the other three are formatted
SAS dates. It doesn't matter for the following MEANS step, but if there is
processing to be done downstream it may be necessary to do a little more work.
For now, run PROC MEANS:
proc means data=formeans mean;
class year half quarter month;
ways 1;
var q1 q2 q3 q4;
run;
Output:
The MEANS Procedure
Month N Obs Variable Mean
JAN09 50000 q1 0.4988148
q2 0.5020346
q3 0.4997779
q4 0.5008723
FEB09 50000 q1 0.5018138
q2 0.5008249
q3 0.5020676
q4 0.4996224
MAR09 50000 q1 0.4996192
q2 0.5002024
q3 0.4990159
q4 0.4986696
APR09 50000 q1 0.4986109
q2 0.5006980
q3 0.5015553
q4 0.5028856
MAY09 50000 q1 0.5022716
q2 0.4969104
q3 0.4997427
q4 0.4986454
JUN09 50000 q1 0.5011484
q2 0.5009835
q3 0.5021118
q4 0.4977901
Quarter N Obs Variable Mean
2009Q1 150000 q1 0.5000826
q2 0.5010206
q3 0.5002871
q4 0.4997214
2009Q2 150000 q1 0.5006770
q2 0.4995306
q3 0.5011366
q4 0.4997737
Half N Obs Variable Mean
2009H1 300000 q1 0.5003798
q2 0.5002756
q3 0.5007119
q4 0.4997476
Year N Obs Variable Mean
2009 300000 q1 0.5003798
q2 0.5002756
q3 0.5007119
q4 0.4997476