Date: Wed, 26 Mar 2003 03:41:58 -0500
Reply-To: Shane Hornibrook <shane_sasl_nospam1@SHANEHORNIBROOK.COM>
Sender: "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From: Shane Hornibrook <shane_sasl_nospam1@SHANEHORNIBROOK.COM>
Subject: Hash on two variables - big datasets, small disk space.
Content-Type: TEXT/PLAIN; charset=US-ASCII
I am merging two files: The first with with patient ID, (geocoded) patient
address, and (geocoded) doctor address. The second is a dataset I have
created with road distance from each possible patient address to each
possible doctor office/hospital.
The patient file has ~2 million records. The address look-up table has
closer to 3 million unique records. I've eliminated duplicate records,
minimized the length of key values, and compressed the datasets.
The merge runs fine, given sufficient resources.
I have now been requested to run a similar analysis on a dataset of > 100
million records. Work space and CPU time are issues in this environment, so I
would like to minimize both.
I am considering using Paul Dorfman's hashing methodologies. What I
have not yet found is a reference for optimal methods for data step
hashing based on two keys. Note: One of the keys (doctor_geocode) has ~20%
fewer values, so that should help speed up the linkage.
SAS v9 has a data step based hash (sounds like they have been paying
attention to Paul!) but this installation is v6.