Date: Fri, 23 Jan 2009 10:42:53 -0400
Reply-To: Muthia Kachirayan <muthia.kachirayan@GMAIL.COM>
Sender: "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From: Muthia Kachirayan <muthia.kachirayan@GMAIL.COM>
Subject: Re: Merging many-to-many, efficiently
In-Reply-To: <b7a7fa630901221646ufb69f2dv1b5c34f7ee0effed@mail.gmail.com>
Content-Type: text/plain; charset=ISO-8859-1
Joe,
Merging can be done in another way using KEY_INDEXING. Here are two
solutions. The first uses your datasets. A 2-dimensioned array,
K[999999,0:3] of $4 is used to capture PERIOD for each periodtype of
CALENDAR, which serves as LOOKUP dataset. The biggest (likely to be)
dataset, SURVEYDATA, is used as a DRIVER.. This array dimension (999999 )
can be reduced to conserve memory space to suit your situation. The code
that follows can be tested for speed along with other solutions.
data need;
do _n_ = 1 by 1 until(eof);
set calendar end = eof;
array k[999999,0:3] $4 _temporary_;
k[periodtype,mod(_n_,4)] = period;
end;
do until(eof1);
set surveydata end = eof1;
do i = 0 to 3;
if k[periodtype,i] ne ' ' then do;
period = k[periodtype,i];
output;
end;
end;
end;
run;
The second solution is based on the fact that the dataset, CALENDAR, uses a
set of 4 PERIODs. To me this set can be easily derivable from PERIODTYPE
using in/formats of SAS and user written codes. Once we make the merging of
CALENDAR using distinct value of PERIODTYPE with that of SURVEYDATA , then
each row of the derived dataset can be made to generate 4 rows to represent
year-month, year-quarter, year-semiyear, year-Y labels. For example, YYMN4.
of PERIODTYPE gives year-month Label. Similarly, YYQ4. gives year-quarter
label. The other 2 labels can be created by user-code ( I could not easily
locate equivalent formats). This logic reduces the size of the dataset to
one-fourth of what you have been thinking of.
Here is the modified code that uses the distinct value. The NEED dataset can
be used to generate other 4 labels.
data need;
do until(eof);
set calendar(keep = periodtype) end = eof;
array k[999999] _temporary_;
k[periodtype] = 1;
end;
do until(eof1);
set surveydata end = eof1;
if k[periodtype] then output;
end;
run;
Kind regards,
Muthia Kachirayan
On Thu, Jan 22, 2009 at 8:46 PM, Joe Matise <snoopy369@gmail.com> wrote:
> Hi Sas-L,
> I have a dataset with question responses and a period value corresponding
> to
> the month the respondent was surveyed. I want to produce a dataset with
> multiple rows per respondent, one for each reporting period that respondent
> qualifies for - month, fiscal quarter, fiscal year, etc. - so I can do a
> proc means using period as a class variable. I have that information in a
> second dataset.
>
> This dataset will grow relatively large (50,000 respondents per month, up
> to
> 18 months reported on at any given time, so 900,000 potential respondents,
> and up to 5 reporting periods they can qualify for each - up to 4.5m rows
> in
> the resulting dataset). I'm considering how potentially other ways of
> running the data might be more efficient, but for the moment let us say
> this
> is how I will go.
>
> What would be the most efficient way of merging my two datasets together
> (or
> otherwise assigning period type)? I've thought of three ways so far, but
> my
> tests even in those methods were inconclusive so far.
>
> Method 1: A merge. I can't actually get this to behave as I want it to
> (hence I can't really test it) and don't necessarily know that it will
> work. Merge does not seem to want to put out the records more than once
> each.
>
> Method 2: A dual set statement, where I set the master dataset, then do n=1
> by 1 until n=nobs using point=N and an if statement to limit the new
> dataset
> to only those obs. where the period is the period of the current record.
> ~20 seconds for 50,000 respondents (one period only), so ~6 minutes for
> 900k
> respondents, I imagine. I'm also experimenting using KEY= but that doesn't
> seem to help (yet).
>
> Method 3: A macro-based solution where I store a bunch of IF statements,
> one
> for each period, that performs five outputs (as appropriate). Seems slow
> as
> you'd have 18 if statements evaluated per row of data, but then again
> method
> 2 does process a much larger dataset.
>
> Method 4: Proc SQL join (see below)
>
> Sample data follows... I'm looking for most efficient solution that is not
> overly complicated (one or two datasteps preferably). I'm avoiding PROC
> SQL
> for the moment because I'm the only one in the office who can really
> understand it (and so nobody else would follow my code if I used it), but
> if
> that turns out to be better it can of course be used. I included method 2
> below, as that's the one I've managed to work out so far that seems
> practicable.
>
> data surveydata;
> drop i;
> do periodtype=200901 to 200906;
> do i = 1 to 50000;
> q1=ranuni(5);
> q2=ranuni(5);
> q3=ranuni(5);
> q4=ranuni(10);
> output;
> end;
> end;
> run;
>
> data calendar;
> input periodtype period $;
> cards;
> 200901 0901
> 200901 09Q1
> 200901 09H1
> 200901 09Y
> 200902 0902
> 200902 09Q1
> 200902 09H1
> 200902 09Y
> 200903 0903
> 200903 09Q1
> 200903 09H1
> 200903 09Y
> 200904 0904
> 200904 09Q2
> 200904 09H1
> 200904 09Y
> 200905 0905
> 200905 09Q2
> 200905 09H1
> 200905 09Y
> 200906 0906
> 200906 09Q2
> 200906 09H1
> 200906 09Y
> ;;;;
> run;
>
> data testset;
> set surveydata;
> by periodtype;
> N=1;
> cperiodtype=periodtype;
> do N=1 by 1 until (N=nobsvar);
> set calendar point=N nobs=nobsvar;
> if periodtype=cperiodtype then output;
> end;
> run;
>
>
> That takes about 6 seconds for 1/3 of my potential rows, but I have ~150
> questions per row, so it takes rather longer, of course. I might convert
> the calendar dataset to horizontal, not vertical (so only one row per
> periodtype, with multiple period variables), but I'm not sure that is any
> more efficient (as I have a lot more statements to process then per row).
>
> Thanks in advance!
>
> -Joe
>
|