LISTSERV at the University of Georgia
Menubar Imagemap
Home Browse Manage Request Manuals Register
Previous messageNext messagePrevious in topicNext in topicPrevious by same authorNext by same authorPrevious page (June 2008, week 1)Back to main SAS-L pageJoin or leave SAS-L (or change settings)ReplyPost a new messageSearchProportional fontNon-proportional font
Date:         Wed, 4 Jun 2008 22:13:20 -0400
Reply-To:     msz03@albany.edu
Sender:       "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From:         Mike Zdeb <msz03@ALBANY.EDU>
Subject:      Re: Question about efficient data extraction
Content-Type: text/plain;charset=iso-8859-1

hi ... I've read all the posts ... I modified the loops in the original job to make the data sets a bit larger ... 10,000 in COHORT that reduces to 9,944 after the NODUPKEY sort ... 30,005,555 in CLAIMS (after changing the loop limit to 10000000) ... using V9.1.3 on a relatively fast PC running XP ...

all methods produce 29,608 hits

#1: MERGE real time 6.29 seconds cpu time 6.28 seconds but you don't have to sort CLAIMS since it's generated in sorted order --- to be fair, the time for a sort of an unsorted CLAIMS data set would be a more realistic test

#2: HASH real time 8.21 seconds cpu time 8.21 seconds no sorts or addition of indexes are ever needed so no additional time for other tasks

#3: TABLE LOOKUP USING A FORMAT real time 18.23 seconds cpu time 18.23 seconds no sorts or addition of indexes are ever needed so no additional time for other tasks

#4: KEY INDEXING WITH A FORMAT real time 2.81 seconds cpu time 2.81 seconds no sorts or addition of indexes are ever needed so no additional time for other tasks

#5: SQL WITH AN INDEXED DATA SET (STILL THE FASTEST AS CITED BELOW) real time 0.15 seconds cpu time 0.15 seconds but I added the INDEX with PROC DATASETS and that also took time --- so what's the real time, 0.15 or 35 seconds ... NOTE: MODIFY was successful for WORK.CLAIMS.DATA. NOTE: PROCEDURE DATASETS used (Total process time): real time 33.57 seconds cpu time 23.00 seconds

I'm not sure if one of the recommended papers you read was ... "Getting Started with the DATA Step Hash Object" ...

http://support.sas.com/rnd/base/datastep/dot/hash-getting-started.pdf

if not, it ends with a nice discussion of the various techniques ... and as Muthia pointed out, Paul Dorfman's papers on KEY INDEXING and HASHING are wonderful

-- Mike Zdeb U@Albany School of Public Health One University Place Rensselaer, New York 12144-3456 P/518-402-6479 F/630-604-1475

> Thanks to Scott, others, and recommended papers, I have found a data step approach that is as > speedy as the SQL that I posted earlier. In an earlier response to Scott's post above, I stated > that it produced only one record per ID. However a slight modification gets all records. > > data test8; > set cohort; > do until(_iorc_ NE 0); > set claims key=id; > if _iorc_ EQ 0 then output; > end; > _error_ = 0; > run; > > NOTE: There were 1000 observations read from the data set WORK.COHORT. > NOTE: The data set WORK.TEST8 has 2943 observations and 3 variables. > NOTE: DATA statement used (Total process time): > real time 0.03 seconds > cpu time 0.03 seconds > > This is the fastest performance that I have tried so far. > > Dan > > Daniel J. Nordlund > Washington State Department of Social and Health Services > Planning, Performance, and Accountability > Research and Data Analysis Division > Olympia, WA 98504-5204 > >


Back to: Top of message | Previous page | Main SAS-L page