Date: Fri, 10 Sep 2010 15:43:06 -0400
Reply-To: Andy Arnold <awasas@COX.NET>
Sender: "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From: Andy Arnold <awasas@COX.NET>
Subject: Hash - 6 Questions
Greetings, All.
I apologize if this is a multiple post. I think I sent this earlier today,
but I can't find it on the list. So here it is again.
I considered making this into multiple posts, but decided to keep it all
together. The overall goal is to improve the performance of jobs that have
suffered from a recent, massive increase in data volume (from a few million
to 50 million and growing). I've already made significant reductions in run
time, but I need more. So hash tables are my next learning curve. (BTW,
the 50M records get exploded into 300M records, which then go through a Proc
Summary/nway with 6 classes.) I've done some test code and experiments, but
I still have a few questions about hash objects; so here they are.
Thanks for your reading time.
--Andy
1. How large is large? Many Hash Object discussions on the web indicate
that large datasets should see performance improvements when using Hash
Objects instead of Proc Summary or a dataset merge. I've converted 3 steps
(2 Summary and a merge) to a single step using 3 Objects, and the usage
numbers show little or no improvement. However, I see a definite
improvement using a Proc Summary/nway replacement to reduce 50M records into
50K.
2. Do Hash Objects perform better with a single, large key than with a
composite key of comparable size? If so, is the improvement enough to
offset the 'cost' of concatenating and de-catenating the components?
3. How do Hash Objects handle null/missing values in a component key? I
expect a completely missing key will fail to store. Will a 3-component key
store the item if the keys are A=1 B=missing C=3?
4. Does HITER get lost/confused when it REMOVEs an item from the Object? I
assume the answer is no, but I need to be told. I've found nothing on the
web indicating that I need to take special care, so I assume that after
removing the 4th item in the hash the HI_mail.next will look at the new 4th
item (old 5th). I know that I usually get lost and screw up the REMOVE
function when I write my own iterators.
5. Does REPLACE also do an ADD if the key/item is missing from the hash
object?Is this always true or are there special cases?
6. Given that the Program Data Vector and a hash object are separate
storage areas, are all of the following hash function descriptions correct?
Check: Verify that item in exists in the hash; PDV and hash are both
unchanged.
Find: Locate item in hash; if the item is found, copy its data fields to
the PDV, otherwise make no change to the PDV.
Add: Create item in the hash and copy PDV data to the item.
Replace: Locate item in the hash and copy PDV data to the item.
Remove: Locate the item in hash and remove its key and data from the hash.