LISTSERV at the University of Georgia
Menubar Imagemap
Home Browse Manage Request Manuals Register
Previous messageNext messagePrevious in topicNext in topicPrevious by same authorNext by same authorPrevious page (January 2009, week 4)Back to main SAS-L pageJoin or leave SAS-L (or change settings)ReplyPost a new messageSearchProportional fontNon-proportional font
Date:         Fri, 23 Jan 2009 10:42:53 -0400
Reply-To:     Muthia Kachirayan <muthia.kachirayan@GMAIL.COM>
Sender:       "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From:         Muthia Kachirayan <muthia.kachirayan@GMAIL.COM>
Subject:      Re: Merging many-to-many, efficiently
In-Reply-To:  <b7a7fa630901221646ufb69f2dv1b5c34f7ee0effed@mail.gmail.com>
Content-Type: text/plain; charset=ISO-8859-1

Joe,

Merging can be done in another way using KEY_INDEXING. Here are two solutions. The first uses your datasets. A 2-dimensioned array, K[999999,0:3] of $4 is used to capture PERIOD for each periodtype of CALENDAR, which serves as LOOKUP dataset. The biggest (likely to be) dataset, SURVEYDATA, is used as a DRIVER.. This array dimension (999999 ) can be reduced to conserve memory space to suit your situation. The code that follows can be tested for speed along with other solutions.

data need; do _n_ = 1 by 1 until(eof); set calendar end = eof; array k[999999,0:3] $4 _temporary_;

k[periodtype,mod(_n_,4)] = period; end; do until(eof1); set surveydata end = eof1; do i = 0 to 3; if k[periodtype,i] ne ' ' then do; period = k[periodtype,i]; output; end; end; end; run;

The second solution is based on the fact that the dataset, CALENDAR, uses a set of 4 PERIODs. To me this set can be easily derivable from PERIODTYPE using in/formats of SAS and user written codes. Once we make the merging of CALENDAR using distinct value of PERIODTYPE with that of SURVEYDATA , then each row of the derived dataset can be made to generate 4 rows to represent year-month, year-quarter, year-semiyear, year-Y labels. For example, YYMN4. of PERIODTYPE gives year-month Label. Similarly, YYQ4. gives year-quarter label. The other 2 labels can be created by user-code ( I could not easily locate equivalent formats). This logic reduces the size of the dataset to one-fourth of what you have been thinking of.

Here is the modified code that uses the distinct value. The NEED dataset can be used to generate other 4 labels.

data need; do until(eof); set calendar(keep = periodtype) end = eof; array k[999999] _temporary_; k[periodtype] = 1; end; do until(eof1); set surveydata end = eof1; if k[periodtype] then output; end; run;

Kind regards,

Muthia Kachirayan

On Thu, Jan 22, 2009 at 8:46 PM, Joe Matise <snoopy369@gmail.com> wrote:

> Hi Sas-L, > I have a dataset with question responses and a period value corresponding > to > the month the respondent was surveyed. I want to produce a dataset with > multiple rows per respondent, one for each reporting period that respondent > qualifies for - month, fiscal quarter, fiscal year, etc. - so I can do a > proc means using period as a class variable. I have that information in a > second dataset. > > This dataset will grow relatively large (50,000 respondents per month, up > to > 18 months reported on at any given time, so 900,000 potential respondents, > and up to 5 reporting periods they can qualify for each - up to 4.5m rows > in > the resulting dataset). I'm considering how potentially other ways of > running the data might be more efficient, but for the moment let us say > this > is how I will go. > > What would be the most efficient way of merging my two datasets together > (or > otherwise assigning period type)? I've thought of three ways so far, but > my > tests even in those methods were inconclusive so far. > > Method 1: A merge. I can't actually get this to behave as I want it to > (hence I can't really test it) and don't necessarily know that it will > work. Merge does not seem to want to put out the records more than once > each. > > Method 2: A dual set statement, where I set the master dataset, then do n=1 > by 1 until n=nobs using point=N and an if statement to limit the new > dataset > to only those obs. where the period is the period of the current record. > ~20 seconds for 50,000 respondents (one period only), so ~6 minutes for > 900k > respondents, I imagine. I'm also experimenting using KEY= but that doesn't > seem to help (yet). > > Method 3: A macro-based solution where I store a bunch of IF statements, > one > for each period, that performs five outputs (as appropriate). Seems slow > as > you'd have 18 if statements evaluated per row of data, but then again > method > 2 does process a much larger dataset. > > Method 4: Proc SQL join (see below) > > Sample data follows... I'm looking for most efficient solution that is not > overly complicated (one or two datasteps preferably). I'm avoiding PROC > SQL > for the moment because I'm the only one in the office who can really > understand it (and so nobody else would follow my code if I used it), but > if > that turns out to be better it can of course be used. I included method 2 > below, as that's the one I've managed to work out so far that seems > practicable. > > data surveydata; > drop i; > do periodtype=200901 to 200906; > do i = 1 to 50000; > q1=ranuni(5); > q2=ranuni(5); > q3=ranuni(5); > q4=ranuni(10); > output; > end; > end; > run; > > data calendar; > input periodtype period $; > cards; > 200901 0901 > 200901 09Q1 > 200901 09H1 > 200901 09Y > 200902 0902 > 200902 09Q1 > 200902 09H1 > 200902 09Y > 200903 0903 > 200903 09Q1 > 200903 09H1 > 200903 09Y > 200904 0904 > 200904 09Q2 > 200904 09H1 > 200904 09Y > 200905 0905 > 200905 09Q2 > 200905 09H1 > 200905 09Y > 200906 0906 > 200906 09Q2 > 200906 09H1 > 200906 09Y > ;;;; > run; > > data testset; > set surveydata; > by periodtype; > N=1; > cperiodtype=periodtype; > do N=1 by 1 until (N=nobsvar); > set calendar point=N nobs=nobsvar; > if periodtype=cperiodtype then output; > end; > run; > > > That takes about 6 seconds for 1/3 of my potential rows, but I have ~150 > questions per row, so it takes rather longer, of course. I might convert > the calendar dataset to horizontal, not vertical (so only one row per > periodtype, with multiple period variables), but I'm not sure that is any > more efficient (as I have a lot more statements to process then per row). > > Thanks in advance! > > -Joe >


Back to: Top of message | Previous page | Main SAS-L page