```Date: Fri, 23 Jan 2009 10:42:53 -0400 Reply-To: Muthia Kachirayan Sender: "SAS(r) Discussion" From: Muthia Kachirayan Subject: Re: Merging many-to-many, efficiently In-Reply-To: Content-Type: text/plain; charset=ISO-8859-1 Joe, Merging can be done in another way using KEY_INDEXING. Here are two solutions. The first uses your datasets. A 2-dimensioned array, K[999999,0:3] of \$4 is used to capture PERIOD for each periodtype of CALENDAR, which serves as LOOKUP dataset. The biggest (likely to be) dataset, SURVEYDATA, is used as a DRIVER.. This array dimension (999999 ) can be reduced to conserve memory space to suit your situation. The code that follows can be tested for speed along with other solutions. data need; do _n_ = 1 by 1 until(eof); set calendar end = eof; array k[999999,0:3] \$4 _temporary_; k[periodtype,mod(_n_,4)] = period; end; do until(eof1); set surveydata end = eof1; do i = 0 to 3; if k[periodtype,i] ne ' ' then do; period = k[periodtype,i]; output; end; end; end; run; The second solution is based on the fact that the dataset, CALENDAR, uses a set of 4 PERIODs. To me this set can be easily derivable from PERIODTYPE using in/formats of SAS and user written codes. Once we make the merging of CALENDAR using distinct value of PERIODTYPE with that of SURVEYDATA , then each row of the derived dataset can be made to generate 4 rows to represent year-month, year-quarter, year-semiyear, year-Y labels. For example, YYMN4. of PERIODTYPE gives year-month Label. Similarly, YYQ4. gives year-quarter label. The other 2 labels can be created by user-code ( I could not easily locate equivalent formats). This logic reduces the size of the dataset to one-fourth of what you have been thinking of. Here is the modified code that uses the distinct value. The NEED dataset can be used to generate other 4 labels. data need; do until(eof); set calendar(keep = periodtype) end = eof; array k[999999] _temporary_; k[periodtype] = 1; end; do until(eof1); set surveydata end = eof1; if k[periodtype] then output; end; run; Kind regards, Muthia Kachirayan On Thu, Jan 22, 2009 at 8:46 PM, Joe Matise wrote: > Hi Sas-L, > I have a dataset with question responses and a period value corresponding > to > the month the respondent was surveyed. I want to produce a dataset with > multiple rows per respondent, one for each reporting period that respondent > qualifies for - month, fiscal quarter, fiscal year, etc. - so I can do a > proc means using period as a class variable. I have that information in a > second dataset. > > This dataset will grow relatively large (50,000 respondents per month, up > to > 18 months reported on at any given time, so 900,000 potential respondents, > and up to 5 reporting periods they can qualify for each - up to 4.5m rows > in > the resulting dataset). I'm considering how potentially other ways of > running the data might be more efficient, but for the moment let us say > this > is how I will go. > > What would be the most efficient way of merging my two datasets together > (or > otherwise assigning period type)? I've thought of three ways so far, but > my > tests even in those methods were inconclusive so far. > > Method 1: A merge. I can't actually get this to behave as I want it to > (hence I can't really test it) and don't necessarily know that it will > work. Merge does not seem to want to put out the records more than once > each. > > Method 2: A dual set statement, where I set the master dataset, then do n=1 > by 1 until n=nobs using point=N and an if statement to limit the new > dataset > to only those obs. where the period is the period of the current record. > ~20 seconds for 50,000 respondents (one period only), so ~6 minutes for > 900k > respondents, I imagine. I'm also experimenting using KEY= but that doesn't > seem to help (yet). > > Method 3: A macro-based solution where I store a bunch of IF statements, > one > for each period, that performs five outputs (as appropriate). Seems slow > as > you'd have 18 if statements evaluated per row of data, but then again > method > 2 does process a much larger dataset. > > Method 4: Proc SQL join (see below) > > Sample data follows... I'm looking for most efficient solution that is not > overly complicated (one or two datasteps preferably). I'm avoiding PROC > SQL > for the moment because I'm the only one in the office who can really > understand it (and so nobody else would follow my code if I used it), but > if > that turns out to be better it can of course be used. I included method 2 > below, as that's the one I've managed to work out so far that seems > practicable. > > data surveydata; > drop i; > do periodtype=200901 to 200906; > do i = 1 to 50000; > q1=ranuni(5); > q2=ranuni(5); > q3=ranuni(5); > q4=ranuni(10); > output; > end; > end; > run; > > data calendar; > input periodtype period \$; > cards; > 200901 0901 > 200901 09Q1 > 200901 09H1 > 200901 09Y > 200902 0902 > 200902 09Q1 > 200902 09H1 > 200902 09Y > 200903 0903 > 200903 09Q1 > 200903 09H1 > 200903 09Y > 200904 0904 > 200904 09Q2 > 200904 09H1 > 200904 09Y > 200905 0905 > 200905 09Q2 > 200905 09H1 > 200905 09Y > 200906 0906 > 200906 09Q2 > 200906 09H1 > 200906 09Y > ;;;; > run; > > data testset; > set surveydata; > by periodtype; > N=1; > cperiodtype=periodtype; > do N=1 by 1 until (N=nobsvar); > set calendar point=N nobs=nobsvar; > if periodtype=cperiodtype then output; > end; > run; > > > That takes about 6 seconds for 1/3 of my potential rows, but I have ~150 > questions per row, so it takes rather longer, of course. I might convert > the calendar dataset to horizontal, not vertical (so only one row per > periodtype, with multiple period variables), but I'm not sure that is any > more efficient (as I have a lot more statements to process then per row). > > Thanks in advance! > > -Joe > ```

Back to: Top of message | Previous page | Main SAS-L page