Date: Wed, 4 Jun 2008 22:13:20 -0400
Reply-To: msz03@albany.edu
Sender: "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From: Mike Zdeb <msz03@ALBANY.EDU>
Subject: Re: Question about efficient data extraction
Content-Type: text/plain;charset=iso-8859-1
hi ... I've read all the posts ... I modified the loops in the original job to make the data sets
a bit larger ... 10,000 in COHORT that reduces to 9,944 after the NODUPKEY sort ... 30,005,555 in
CLAIMS (after changing the loop limit to 10000000) ... using V9.1.3 on a relatively fast PC
running XP ...
all methods produce 29,608 hits
#1: MERGE
real time 6.29 seconds
cpu time 6.28 seconds
but you don't have to sort CLAIMS since it's generated in sorted order --- to be fair, the time
for a sort of an unsorted CLAIMS data set would be a more realistic test
#2: HASH
real time 8.21 seconds
cpu time 8.21 seconds
no sorts or addition of indexes are ever needed so no additional time for other tasks
#3: TABLE LOOKUP USING A FORMAT
real time 18.23 seconds
cpu time 18.23 seconds
no sorts or addition of indexes are ever needed so no additional time for other tasks
#4: KEY INDEXING WITH A FORMAT
real time 2.81 seconds
cpu time 2.81 seconds
no sorts or addition of indexes are ever needed so no additional time for other tasks
#5: SQL WITH AN INDEXED DATA SET (STILL THE FASTEST AS CITED BELOW)
real time 0.15 seconds
cpu time 0.15 seconds
but I added the INDEX with PROC DATASETS and that also took time --- so what's the real time, 0.15
or 35 seconds ...
NOTE: MODIFY was successful for WORK.CLAIMS.DATA.
NOTE: PROCEDURE DATASETS used (Total process time):
real time 33.57 seconds
cpu time 23.00 seconds
I'm not sure if one of the recommended papers you read was ... "Getting Started with the DATA Step
Hash Object" ...
http://support.sas.com/rnd/base/datastep/dot/hash-getting-started.pdf
if not, it ends with a nice discussion of the various techniques ... and as Muthia pointed out,
Paul Dorfman's papers on KEY INDEXING and HASHING are wonderful
--
Mike Zdeb
U@Albany School of Public Health
One University Place
Rensselaer, New York 12144-3456
P/518-402-6479 F/630-604-1475
> Thanks to Scott, others, and recommended papers, I have found a data step approach that is as
> speedy as the SQL that I posted earlier. In an earlier response to Scott's post above, I stated
> that it produced only one record per ID. However a slight modification gets all records.
>
> data test8;
> set cohort;
> do until(_iorc_ NE 0);
> set claims key=id;
> if _iorc_ EQ 0 then output;
> end;
> _error_ = 0;
> run;
>
> NOTE: There were 1000 observations read from the data set WORK.COHORT.
> NOTE: The data set WORK.TEST8 has 2943 observations and 3 variables.
> NOTE: DATA statement used (Total process time):
> real time 0.03 seconds
> cpu time 0.03 seconds
>
> This is the fastest performance that I have tried so far.
>
> Dan
>
> Daniel J. Nordlund
> Washington State Department of Social and Health Services
> Planning, Performance, and Accountability
> Research and Data Analysis Division
> Olympia, WA 98504-5204
>
>