LISTSERV at the University of Georgia
Menubar Imagemap
Home Browse Manage Request Manuals Register
Previous messageNext messagePrevious in topicNext in topicPrevious by same authorNext by same authorPrevious page (March 2003, week 4)Back to main SAS-L pageJoin or leave SAS-L (or change settings)ReplyPost a new messageSearchProportional fontNon-proportional font
Date:         Wed, 26 Mar 2003 03:41:58 -0500
Reply-To:     Shane Hornibrook <shane_sasl_nospam1@SHANEHORNIBROOK.COM>
Sender:       "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From:         Shane Hornibrook <shane_sasl_nospam1@SHANEHORNIBROOK.COM>
Subject:      Hash on two variables - big datasets, small disk space.
In-Reply-To:  <200303250340.h2P3eJ907594@pasta.cc.uga.edu>
Content-Type: TEXT/PLAIN; charset=US-ASCII

Hello All,

I am merging two files: The first with with patient ID, (geocoded) patient address, and (geocoded) doctor address. The second is a dataset I have created with road distance from each possible patient address to each possible doctor office/hospital.

The patient file has ~2 million records. The address look-up table has closer to 3 million unique records. I've eliminated duplicate records, minimized the length of key values, and compressed the datasets.

The merge runs fine, given sufficient resources.

I have now been requested to run a similar analysis on a dataset of > 100 million records. Work space and CPU time are issues in this environment, so I would like to minimize both.

I am considering using Paul Dorfman's hashing methodologies. What I have not yet found is a reference for optimal methods for data step hashing based on two keys. Note: One of the keys (doctor_geocode) has ~20% fewer values, so that should help speed up the linkage.

SAS v9 has a data step based hash (sounds like they have been paying attention to Paul!) but this installation is v6.

Thanks, --Shane Shane Hornibrook Mobile: (902)441-4158 shane_sasl_nospam1@shanehornibrook.com http://www.shanehornibrook.com/


Back to: Top of message | Previous page | Main SAS-L page