Date: Fri, 28 Mar 2003 00:03:06 -0500
Reply-To: Shane Hornibrook <shane_sasl_nospam1@SHANEHORNIBROOK.COM>
Sender: "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From: Shane Hornibrook <shane_sasl_nospam1@SHANEHORNIBROOK.COM>
Subject: Re: Hash on two variables - big datasets, small disk space.
Content-Type: TEXT/PLAIN; charset=US-ASCII
Combining the two keys into one unique key is a good idea for the dataset
that has numeric keys. Thanks!
In one of the datasets I am fuzzy matching one of the keys (it is a
string), so combining them would just complicate matters for those
I like the technique regarding the hierarchical searching of the index.
If I understand correctly, you're also suggesting I offload the actual
distance to a different table, and use a replacement binary number in the
hash that gets loaded to memory? That would certainly help with memory
issues! I can even afford to simplify the distances (e.g. it is important
to differentiate between 1.4 and 1.5 km, but not as important to
differentiate between 273 and 275 km ... for this dataset).
Thanks for the excellent suggestions.
On Wed, 26 Mar 2003, Sigurd Hermansen wrote:
> I have tried most of the possible indexing schemes involving Dorfman
> indexes. Paul originally provided a set of SAS macros (%hsize, %hload,
> %hsearch), DOW loops that load indexes from one dataset and search indexes
> while scanning another dataset. Paul has also defined binary variables that
> compress data into minimal variable lengths.
> I suspect that you have a bandwidth problem and an memory limit problem.
> Under Dorfman hash indexes, the index only needs memory sufficient to hold
> the values being loaded in the index, plus a relatively small amount of
> memory to hold hash value conflicts. Nonetheless, each additional bit in the
> index value space multiplies by the number of rows being indexed. The index
> values in 80 million rows of data have to go somewhere. (More on that
> later.) For a direct implementation of a hash index, you'll need
> 80M X index value length
> bytes of memory space to hold the index.
> While you have suggested that you have two variables to index, you actually
> have only one: most likely a binary number composed from the two geocodes.
> Again, I'll have to defer to Paul's expertise in composition of values into
> a single index. At the logical level, it makes sense to combine the two
> geocodes since distance is a function of the two geocode arguments. Looking
> up one geocode and the other independently will tell you nothing about the
> distance between the two.
> Now back to the size of the index. Since you are using the index as a
> look-up table, you may be able to reduce the number of values that need to
> be indexed. If you group the (geocode1,geocode2) combinations that map to
> approximately the same distance, you may find that the you have large
> numbers of geocode combinations mapping to the same distance. If so, it
> would make sense to set up a hierarchial indexing scheme. First, assign a
> unique binary number to each element in each set of geocode code
> combinations mapped to the same distance. Then, create a hash index of the
> binary numbers, and load the distances corresponding to the binary numbers
> into the index. After loading the index, proceed to scan the dataset
> containing the geocodes. For each row in the dataset, link the combination
> of geocodes to the appropriate binary number. Next, look up the distance on
> the hash index using the binary number. A direct access link to a binary
> number and hash indexing of the binary number should work very efficiently.
> The loading of the index function that maps geocode combinations to unique
> binary numbers has to be done only once for each set of geocode
> combinations. The index can be stored as a permanent SAS dataset and updated
> when necessary. The scan scan of a 100M file works very efficiently in SAS.
> The direct access and hash indexing should take little more than the time
> required to scan the file.
> I recommend that you talk to Paul about checking to see whether a
> hierarchial index scheme fits your data, and about the details of an
> implementation. He may be able to engineer a quick and efficient solution to
> what seems at first an overwhelming problem.
> -----Original Message-----
> From: Shane Hornibrook [mailto:shane_sasl_nospam1@SHANEHORNIBROOK.COM]
> Sent: Wednesday, March 26, 2003 3:42 AM
> To: SAS-L@LISTSERV.UGA.EDU
> Subject: Hash on two variables - big datasets, small disk space.
> Hello All,
> I am merging two files: The first with with patient ID, (geocoded) patient
> address, and (geocoded) doctor address. The second is a dataset I have
> created with road distance from each possible patient address to each
> possible doctor office/hospital.
> The patient file has ~2 million records. The address look-up table has
> closer to 3 million unique records. I've eliminated duplicate records,
> minimized the length of key values, and compressed the datasets.
> The merge runs fine, given sufficient resources.
> I have now been requested to run a similar analysis on a dataset of > 100
> million records. Work space and CPU time are issues in this environment, so
> would like to minimize both.
> I am considering using Paul Dorfman's hashing methodologies. What I
> have not yet found is a reference for optimal methods for data step
> hashing based on two keys. Note: One of the keys (doctor_geocode) has ~20%
> fewer values, so that should help speed up the linkage.
> SAS v9 has a data step based hash (sounds like they have been paying
> attention to Paul!) but this installation is v6.
> Shane Hornibrook
> Mobile: (902)441-4158