Date: Wed, 26 Mar 2003 03:41:58 -0500
Reply-To: Shane Hornibrook <shane_sasl_nospam1@SHANEHORNIBROOK.COM>
Sender: "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From: Shane Hornibrook <shane_sasl_nospam1@SHANEHORNIBROOK.COM>
Subject: Hash on two variables - big datasets, small disk space.
In-Reply-To: <200303250340.h2P3eJ907594@pasta.cc.uga.edu>
Content-Type: TEXT/PLAIN; charset=US-ASCII
Hello All,
I am merging two files: The first with with patient ID, (geocoded) patient
address, and (geocoded) doctor address. The second is a dataset I have
created with road distance from each possible patient address to each
possible doctor office/hospital.
The patient file has ~2 million records. The address look-up table has
closer to 3 million unique records. I've eliminated duplicate records,
minimized the length of key values, and compressed the datasets.
The merge runs fine, given sufficient resources.
I have now been requested to run a similar analysis on a dataset of > 100
million records. Work space and CPU time are issues in this environment, so I
would like to minimize both.
I am considering using Paul Dorfman's hashing methodologies. What I
have not yet found is a reference for optimal methods for data step
hashing based on two keys. Note: One of the keys (doctor_geocode) has ~20%
fewer values, so that should help speed up the linkage.
SAS v9 has a data step based hash (sounds like they have been paying
attention to Paul!) but this installation is v6.
Thanks,
--Shane
Shane Hornibrook
Mobile: (902)441-4158
shane_sasl_nospam1@shanehornibrook.com
http://www.shanehornibrook.com/