| Date: | Wed, 2 Dec 1998 20:53:56 -0500 |
| Reply-To: | "Dorfman, Paul" <pdorfma@UCS.ATT.COM> |
| Sender: | "SAS(r) Discussion" <SAS-L@UGA.CC.UGA.EDU> |
| From: | "Dorfman, Paul" <pdorfma@UCS.ATT.COM> |
| Subject: | Re: Matching Events |
| Content-Type: | text/plain; charset="iso-8859-1" |
|---|
Alex Martchenko, in part, wrote:
>Here's the problem I need assistance with. A very large (order of 10
million) SAS dataset MASTER has >variables DATE and TIME and some additional
info in a number of other vars but let's say there is only 1 >extra var
MINFO. I also have a small dataset EVENTS (about 20000 obs) with only DATE
and TIME. In this small >file (but _not_ in big) DATE values are all unique.
In both files DATE may be anything from around 96 years >before now to today
and TIME range is 8 a.m. to 5 p.m. I'm trying to come up with the most
efficient way of >selecting only those events from the big file that are
also on the small file (I expect maybe a few hundred >second-to-second
matches). I've tried to merge them but with sorting, it takes too long. It
wouldn't be that >bad if the big file was static but unfortunately it is
refreshed daily and the match has to be run many >times a day against
different small datasets. Any suggestions will be very much appreciated.
Alex,
In your situation, the most efficient way of matching is dictated by the
nature of your data. Note that the match keys (the values of DATE) on your
search file (EVENTS) satisfy two conditions:
1. They are numeric integers because they are SAS date values.
2. They fall into a limited range, as they embrace a limited time span.
Therefore, your case lends it ideally to the lookup method one could call
"key-indexed search". That is, we can use the key values to index a lookup
table large enough for all possible distinct key values. Under your
circumstances, a temporary array with not less than, say, 36,500 elements
(number of unique date values in a 100 year period) will more than suffice.
Let us assume for simplicity that EVENTS contains only 5 records:
DATE TIME
-09131 40786
-01826 59262
+05479 54709
+09132 45610
+12784 56120
An array cannot be indexed by negative values but we can easily hash to
positive ones simply by subtracting DATE from current date, DATE(). That
yields:
DATE()-DATE TIME
23346 40786
16041 59262
08736 54709
05083 45610
01431 56120
Suppose we have declared an array
ARRAY DKEY(36500) _TEMPORARY_;
and stored the time value 40786 in the occurrence 23346, 59262 - in 16041
and so on. Imagine now that the current record from MASTER has DATE()-DATE =
08736. Then DKEY(08736) will tell us IMMEDIATELY that this value is found
and the corresponding TIME value is 54709. If, on the other hand,
DATE()-DATE coming from MASTER is, say, 14876, DKEY(14876) will tell us,
also IMMEDIATELY, that search has failed since a missing value is returned.
The entire search process thus amounts to a single array reference; no sort
of any kind is needed; no key comparisons are involved. Expressing the above
in SAS Language,
*** KEY-INDEXED MATCH ***;
DATA MATCHED (KEEP=DATE TIME MINFO);
DT = DATE();
ARRAY DKEY (36500) _TEMPORARY_;
DO UNTIL(DEND);
SET EVENTS END=DEND;
DKEY (DT - DATE) = TIME;
END;
DO UNTIL(MEND);
SET MASTER END=MEND;
IF TIME = DKEY (DT - DATE) THEN OUTPUT;
END;
RUN;
From the nature of key-indexed search, we can assert that within the domain
of its applicability, no other lookup method will come even close in
performance. Let us put this assertion to stress-test. Your data could be
simulated as follows (everything is run on NT/333/128):
%LET EOBS = 20000;
%LET MOBS = 10000000;
DATA EVENTS (KEEP=DATE TIME) MASTER (KEEP=DATE TIME MINFO);
DT = DATE();
DINCR = CEIL((DT - INTNX('YR',DT,-96))/&EOBS);
DATE = DT;
DO I=1 TO &EOBS;
DATE = DATE - CEIL(RANUNI(1)*DINCR);
TIME = 8*3600 + CEIL(RANUNI(2)*(17-8)*3600);
OUTPUT EVENTS;
END;
DO I=1 TO &MOBS;
DATE = DT - CEIL(RANUNI(3)*2*DT);
TIME = 8*3600 + CEIL(RANUNI(4)*(17-8)*3600);
MINFO = 'MINFO';
OUTPUT MASTER;
END;
RUN;
PROC SORT DATA=EVENTS; BY TIME; RUN; * disorder by DATE;
Key-indexed match, as coded above, results in the following log messages:
NOTE: The data set WORK.MATCHED has 187 observations and 3 variables.
NOTE: The DATA statement used 24.85 seconds.
I have chosen two other techniques - SQL and formatted match - to see how
well they perform in comparison:
23 *** SQL MATCH ***;
24
25 PROC SQL;
26 CREATE TABLE SUBSET AS
27 SELECT M.DATE, M.TIME, MINFO
28 FROM EVENTS E, MASTER M
29 WHERE M.DATE = E.DATE AND M.TIME = E.TIME;
30 QUIT;
NOTE: Table WORK.SUBSET created, with 187 rows and 3 columns.
NOTE: The PROCEDURE SQL used 1 minute 19.32 seconds.
32 *** FORMATTED MATCH ***;
33
34 DATA FMT (DROP=DATE TIME);
35 FMTNAME = 'DATES'; TYPE = 'I';
36 DO UNTIL(DEND);
37 SET EVENTS END=DEND;
38 START = DATE; LABEL = TIME;
39 OUTPUT;
40 END;
41 LABEL = .; HLO = 'O';
42 OUTPUT;
43 RUN;
NOTE: The data set WORK.FMT has 20001 observations and 5 variables.
NOTE: The DATA statement used 0.67 seconds.
45 PROC FORMAT CNTLIN=FMT;
NOTE: The PROCEDURE FORMAT used 1.85 seconds.
47 DATA MATCHED (KEEP=DATE TIME MINFO);
48 SET MASTER;
49 IF TIME = INPUT(PUT(DATE,6.),DATES.);
50 RUN;
NOTE: The data set WORK.MATCHED has 187 observations and 3 variables.
NOTE: The DATA statement used 3 minutes 12.46 seconds.
As we see, key-indexed match runs 3.18 times faster than SQL and 7.74 times
faster than formatted match. Being surprised that SQL out-performed the
relatively small format 2.4 times, I repeated the test under MVS and on both
machines with lesser number of observations in MASTER; the disparity was
less pronounced but the pattern still held. Go figure...
Conclusion: Use key-indexed search (match) whenever keys are integers
falling into a limited range or can be easily hashed to this type of data.
In the case of date and time (not datetime!) SAS values, the latter is
always true. The "limited range" is essentially determined by the maximum
size of a temporary array that can be allocated in a DATA step, that is by
available memory. Given the performance superiority of key-indexed search
over any other method, it may be worthwhile to attempt adapting long keys to
this technique. Maybe, someone on the list (not afraid of hashing theory)
would give it a thought?
Kind regards,
Paul
+++++++++++++++++++++++
Paul M. Dorfman
Citibank UCS ITS
Jacksonville, FL
+++++++++++++++++++++++
|