LISTSERV at the University of Georgia
Menubar Imagemap
Home Browse Manage Request Manuals Register
Previous messageNext messagePrevious in topicNext in topicPrevious by same authorNext by same authorPrevious page (December 1998, week 1)Back to main SAS-L pageJoin or leave SAS-L (or change settings)ReplyPost a new messageSearchProportional fontNon-proportional font
Date:   Wed, 2 Dec 1998 20:53:56 -0500
Reply-To:   "Dorfman, Paul" <pdorfma@UCS.ATT.COM>
Sender:   "SAS(r) Discussion" <SAS-L@UGA.CC.UGA.EDU>
From:   "Dorfman, Paul" <pdorfma@UCS.ATT.COM>
Subject:   Re: Matching Events
Content-Type:   text/plain; charset="iso-8859-1"

Alex Martchenko, in part, wrote:

>Here's the problem I need assistance with. A very large (order of 10 million) SAS dataset MASTER has >variables DATE and TIME and some additional info in a number of other vars but let's say there is only 1 >extra var MINFO. I also have a small dataset EVENTS (about 20000 obs) with only DATE and TIME. In this small >file (but _not_ in big) DATE values are all unique. In both files DATE may be anything from around 96 years >before now to today and TIME range is 8 a.m. to 5 p.m. I'm trying to come up with the most efficient way of >selecting only those events from the big file that are also on the small file (I expect maybe a few hundred >second-to-second matches). I've tried to merge them but with sorting, it takes too long. It wouldn't be that >bad if the big file was static but unfortunately it is refreshed daily and the match has to be run many >times a day against different small datasets. Any suggestions will be very much appreciated.

Alex,

In your situation, the most efficient way of matching is dictated by the nature of your data. Note that the match keys (the values of DATE) on your search file (EVENTS) satisfy two conditions:

1. They are numeric integers because they are SAS date values. 2. They fall into a limited range, as they embrace a limited time span.

Therefore, your case lends it ideally to the lookup method one could call "key-indexed search". That is, we can use the key values to index a lookup table large enough for all possible distinct key values. Under your circumstances, a temporary array with not less than, say, 36,500 elements (number of unique date values in a 100 year period) will more than suffice. Let us assume for simplicity that EVENTS contains only 5 records:

DATE TIME

-09131 40786 -01826 59262 +05479 54709 +09132 45610 +12784 56120

An array cannot be indexed by negative values but we can easily hash to positive ones simply by subtracting DATE from current date, DATE(). That yields:

DATE()-DATE TIME

23346 40786 16041 59262 08736 54709 05083 45610 01431 56120

Suppose we have declared an array

ARRAY DKEY(36500) _TEMPORARY_;

and stored the time value 40786 in the occurrence 23346, 59262 - in 16041 and so on. Imagine now that the current record from MASTER has DATE()-DATE = 08736. Then DKEY(08736) will tell us IMMEDIATELY that this value is found and the corresponding TIME value is 54709. If, on the other hand, DATE()-DATE coming from MASTER is, say, 14876, DKEY(14876) will tell us, also IMMEDIATELY, that search has failed since a missing value is returned. The entire search process thus amounts to a single array reference; no sort of any kind is needed; no key comparisons are involved. Expressing the above in SAS Language,

*** KEY-INDEXED MATCH ***;

DATA MATCHED (KEEP=DATE TIME MINFO); DT = DATE(); ARRAY DKEY (36500) _TEMPORARY_;

DO UNTIL(DEND); SET EVENTS END=DEND; DKEY (DT - DATE) = TIME; END;

DO UNTIL(MEND); SET MASTER END=MEND; IF TIME = DKEY (DT - DATE) THEN OUTPUT; END; RUN;

From the nature of key-indexed search, we can assert that within the domain of its applicability, no other lookup method will come even close in performance. Let us put this assertion to stress-test. Your data could be simulated as follows (everything is run on NT/333/128):

%LET EOBS = 20000; %LET MOBS = 10000000; DATA EVENTS (KEEP=DATE TIME) MASTER (KEEP=DATE TIME MINFO); DT = DATE(); DINCR = CEIL((DT - INTNX('YR',DT,-96))/&EOBS); DATE = DT; DO I=1 TO &EOBS; DATE = DATE - CEIL(RANUNI(1)*DINCR); TIME = 8*3600 + CEIL(RANUNI(2)*(17-8)*3600); OUTPUT EVENTS; END; DO I=1 TO &MOBS; DATE = DT - CEIL(RANUNI(3)*2*DT); TIME = 8*3600 + CEIL(RANUNI(4)*(17-8)*3600); MINFO = 'MINFO'; OUTPUT MASTER; END; RUN; PROC SORT DATA=EVENTS; BY TIME; RUN; * disorder by DATE;

Key-indexed match, as coded above, results in the following log messages:

NOTE: The data set WORK.MATCHED has 187 observations and 3 variables. NOTE: The DATA statement used 24.85 seconds.

I have chosen two other techniques - SQL and formatted match - to see how well they perform in comparison:

23 *** SQL MATCH ***; 24 25 PROC SQL; 26 CREATE TABLE SUBSET AS 27 SELECT M.DATE, M.TIME, MINFO 28 FROM EVENTS E, MASTER M 29 WHERE M.DATE = E.DATE AND M.TIME = E.TIME; 30 QUIT;

NOTE: Table WORK.SUBSET created, with 187 rows and 3 columns. NOTE: The PROCEDURE SQL used 1 minute 19.32 seconds.

32 *** FORMATTED MATCH ***; 33 34 DATA FMT (DROP=DATE TIME); 35 FMTNAME = 'DATES'; TYPE = 'I'; 36 DO UNTIL(DEND); 37 SET EVENTS END=DEND; 38 START = DATE; LABEL = TIME; 39 OUTPUT; 40 END; 41 LABEL = .; HLO = 'O'; 42 OUTPUT; 43 RUN;

NOTE: The data set WORK.FMT has 20001 observations and 5 variables. NOTE: The DATA statement used 0.67 seconds.

45 PROC FORMAT CNTLIN=FMT;

NOTE: The PROCEDURE FORMAT used 1.85 seconds.

47 DATA MATCHED (KEEP=DATE TIME MINFO); 48 SET MASTER; 49 IF TIME = INPUT(PUT(DATE,6.),DATES.); 50 RUN;

NOTE: The data set WORK.MATCHED has 187 observations and 3 variables. NOTE: The DATA statement used 3 minutes 12.46 seconds.

As we see, key-indexed match runs 3.18 times faster than SQL and 7.74 times faster than formatted match. Being surprised that SQL out-performed the relatively small format 2.4 times, I repeated the test under MVS and on both machines with lesser number of observations in MASTER; the disparity was less pronounced but the pattern still held. Go figure...

Conclusion: Use key-indexed search (match) whenever keys are integers falling into a limited range or can be easily hashed to this type of data. In the case of date and time (not datetime!) SAS values, the latter is always true. The "limited range" is essentially determined by the maximum size of a temporary array that can be allocated in a DATA step, that is by available memory. Given the performance superiority of key-indexed search over any other method, it may be worthwhile to attempt adapting long keys to this technique. Maybe, someone on the list (not afraid of hashing theory) would give it a thought?

Kind regards,

Paul +++++++++++++++++++++++ Paul M. Dorfman Citibank UCS ITS Jacksonville, FL +++++++++++++++++++++++


Back to: Top of message | Previous page | Main SAS-L page