Date: Thu, 22 Jan 2009 18:46:53 -0600
Reply-To: Joe Matise <snoopy369@GMAIL.COM>
Sender: "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From: Joe Matise <snoopy369@GMAIL.COM>
Subject: Merging many-to-many, efficiently
Content-Type: text/plain; charset=ISO-8859-1
Hi Sas-L,
I have a dataset with question responses and a period value corresponding to
the month the respondent was surveyed. I want to produce a dataset with
multiple rows per respondent, one for each reporting period that respondent
qualifies for - month, fiscal quarter, fiscal year, etc. - so I can do a
proc means using period as a class variable. I have that information in a
second dataset.
This dataset will grow relatively large (50,000 respondents per month, up to
18 months reported on at any given time, so 900,000 potential respondents,
and up to 5 reporting periods they can qualify for each - up to 4.5m rows in
the resulting dataset). I'm considering how potentially other ways of
running the data might be more efficient, but for the moment let us say this
is how I will go.
What would be the most efficient way of merging my two datasets together (or
otherwise assigning period type)? I've thought of three ways so far, but my
tests even in those methods were inconclusive so far.
Method 1: A merge. I can't actually get this to behave as I want it to
(hence I can't really test it) and don't necessarily know that it will
work. Merge does not seem to want to put out the records more than once
each.
Method 2: A dual set statement, where I set the master dataset, then do n=1
by 1 until n=nobs using point=N and an if statement to limit the new dataset
to only those obs. where the period is the period of the current record.
~20 seconds for 50,000 respondents (one period only), so ~6 minutes for 900k
respondents, I imagine. I'm also experimenting using KEY= but that doesn't
seem to help (yet).
Method 3: A macro-based solution where I store a bunch of IF statements, one
for each period, that performs five outputs (as appropriate). Seems slow as
you'd have 18 if statements evaluated per row of data, but then again method
2 does process a much larger dataset.
Method 4: Proc SQL join (see below)
Sample data follows... I'm looking for most efficient solution that is not
overly complicated (one or two datasteps preferably). I'm avoiding PROC SQL
for the moment because I'm the only one in the office who can really
understand it (and so nobody else would follow my code if I used it), but if
that turns out to be better it can of course be used. I included method 2
below, as that's the one I've managed to work out so far that seems
practicable.
data surveydata;
drop i;
do periodtype=200901 to 200906;
do i = 1 to 50000;
q1=ranuni(5);
q2=ranuni(5);
q3=ranuni(5);
q4=ranuni(10);
output;
end;
end;
run;
data calendar;
input periodtype period $;
cards;
200901 0901
200901 09Q1
200901 09H1
200901 09Y
200902 0902
200902 09Q1
200902 09H1
200902 09Y
200903 0903
200903 09Q1
200903 09H1
200903 09Y
200904 0904
200904 09Q2
200904 09H1
200904 09Y
200905 0905
200905 09Q2
200905 09H1
200905 09Y
200906 0906
200906 09Q2
200906 09H1
200906 09Y
;;;;
run;
data testset;
set surveydata;
by periodtype;
N=1;
cperiodtype=periodtype;
do N=1 by 1 until (N=nobsvar);
set calendar point=N nobs=nobsvar;
if periodtype=cperiodtype then output;
end;
run;
That takes about 6 seconds for 1/3 of my potential rows, but I have ~150
questions per row, so it takes rather longer, of course. I might convert
the calendar dataset to horizontal, not vertical (so only one row per
periodtype, with multiple period variables), but I'm not sure that is any
more efficient (as I have a lot more statements to process then per row).
Thanks in advance!
-Joe