LISTSERV at the University of Georgia
Menubar Imagemap
Home Browse Manage Request Manuals Register
Previous messageNext messagePrevious in topicNext in topicPrevious by same authorNext by same authorPrevious page (December 2009, week 1)Back to main SAS-L pageJoin or leave SAS-L (or change settings)ReplyPost a new messageSearchProportional fontNon-proportional font
Date:         Tue, 1 Dec 2009 16:12:38 -0500
Reply-To:     "Keintz, H. Mark" <mkeintz@WHARTON.UPENN.EDU>
Sender:       "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From:         "Keintz, H. Mark" <mkeintz@WHARTON.UPENN.EDU>
Subject:      Re: Hashing with SAS9.2 using multiple hash keys and a date range
              (Revisited)
Comments: To: Muthia Kachirayan <muthia.kachirayan@GMAIL.COM>
In-Reply-To:  <2fc7f3340912010658s23385d03k239628f6f9cebbb3@mail.gmail.com>
Content-Type: text/plain; charset="us-ascii"

SAS-L: Aready had too much hash? Then skip this email.

Muthia:

You presented a very interesting and informative comparison test.

I replicated your results (on my machine 18.8 seconds for my code and 0.5 for your MK1), and dug in a little deeper to understand why. I the dramatic difference between your code and my suggestion is attributable to 3 issues:

1. A suboptimal choice on my part for hashexp parameter in the DECLARE HASH statement (when I changed it from 16 to 24, CPU time and elapsed time dropped by 33%).

Take home message: HASHEXP can matter.

2. Scale of the dataset. My program has a fixed cost of building the hash table (probably about 60 seconds on your platform) wheras yours will have virtually no hash build time.

But if the part following the hash table build (i.e. reading through Saleshistory) were made to process 50 million records (as per the OP's situation) vs. your sample of 1 million, the relative performance would be closer. I did such a test and the ratio then dropped to 3:1 (75 seconds for my program and 23 for your MK1 progam).

Take home message: To get the advantage of hash lookup, you have to process enough data to make up for the time required to make the hash table.

3. Hash lookup isn't close enough to exactly O(1) to always be superior in this case. In fact, Paul Dorfman has told us that it should be close to O(log2(N/M)), where N is number of keys M is number of buckets.

To check this, I changed your CLIENTDISCOUNT dataset to have an average date range of 500 days (as opposed your average of 5,000), resulting in about one-tenth the original number of keys. A retest with the smaller hash table showed a 50% drop in the time required to process the hash (i.e. after controlling for hash building). True O(1) performance would yield no savings, but O(log2(N/M) in my case would generate maybe 50% savings on the retrieval side, which it did.

Even with all this, your MK1 program was still 23 cpu seconds, vs. 40 seconds in my tweaked program (including both hash creation and hash utilization).

Only when I reduced the average date range in CLIENTDISCOUNT to 50 days (i.e. about 50 items per clientid, instead of the original 5000) did the cpu and elapsed time for my program match yours.

Bottom line: Building a really big hash table? Don't do it merely to replace an "IF A <= x <= C" statement.

Or put in other words: the "if A <= x <= C" statement is not sensitive to the difference of C minus A, but making and using a hash table of lookup values for a grid between A and C is.

Thanks, Mark

.. and now back to work. MK

> -----Original Message----- > From: SAS(r) Discussion [mailto:SAS-L@LISTSERV.UGA.EDU] On Behalf Of > Muthia Kachirayan > Sent: Tuesday, December 01, 2009 9:59 AM > To: SAS-L@LISTSERV.UGA.EDU > Subject: Re: Hashing with SAS9.2 using multiple hash keys and a date > range (Revisited) > > SAS-Lers, > > Mark gave a solution with the conjecture that his solution may be > faster than other 3 solutions though it takes more memory to store the > hash table. I too thought that the lookup time should be small as > there is a direct hit. I have tested this hypothesis with a small data > set given below and used the 4 solutions. Karma's solution failed > after running sometime with the message: > > "Hash object added **** items when memory failure occurred." > > I compared then 3 solutions by - Mark, MK_9.1 and MK_9.2. To minimize > the I/O operations, the data _null_ statement is used. The units of > time taken by these are 68.01, 1.03 and 1.00 respectively. > Surprisingly, the 9.1 solution is as good as 9.2. > > The data sets used is generated by the code: > > > > **** Generating Data Sets *******; > data saleshistory(keep = clientId productCode dateOfSale > fullAmtPayable) > clientdiscount(keep = clientId fromDate toDate clientDiscount); > length clientid $8 productCode $2 ; > do i = 1 to 1e6; > k = ceil(ranuni(123) * 1e6); > clientid = 'C' || put(k,z7.); > productCode = byte(65 + ranuni(123) * 25) || byte(65 + ranuni(123) * > 25) ; > dateOfSale = ceil(ranuni(123) * 1e6); > fromDate = dateOfSale - ceil(ranuni(123) * 5000); > toDate = dateOfSale + ceil(ranuni(123) * 5000); > fullAmtPayable = ceil(ranuni(123) * 5000); > clientDiscount = round(ranuni(123)/10,.01); > output saleshistory; > if ranuni(123) < 5e-3 then output clientdiscount; > end; > run; > > *** Mark's Code ***; > > data _null_; > if 0 then set clientdiscount saleshistory; > > declare hash hdisc (hashexp:16); > hdisc.definekey('clientid','dateofsale'); > hdisc.definedata('clientdiscount'); > hdisc.definedone(); > > ** Fill the lookup table **; > do until (end_of_discounts); > set clientdiscount end=end_of_discounts; > if todate = . then todate = today(); > do dateofsale = fromdate to todate; > _rc=hdisc.add(); > end; > end; > > ** Use hash lookup (keyed on clientid/dateofsale) to get discounts **; > do until (end_of_sales); > set salesHistory end=end_of_sales; > clientdiscount=.; > _rc = hdisc.find() ; > output; > end; > > drop fromdate todate _rc; > run; > > *** MK_9.1 Code ***; > > data _null_; > if _n_ = 1 then do; > if 0 then set clientDiscount; > declare hash h(); > h.definekey('id','clientid'); > h.definedata('fromdate','todate','clientDiscount') ; > h.definedone(); > do until(z); > set clientDiscount end = z; > if todate = . then todate = 1e6; > do id = 1 by 1 until(h.check() ne 0); > end; > h.add(); > end; > end; > set salesHistory ; > do id = 1 by 1 while(h.find() = 0 and not _in_range); > _in_range = fromdate <= dateofsale <= todate; > if _in_range then output; > end; > if not _in_range then do; clientDiscount = .; output; end; > drop _:; > run; > > *** MK_9.2 Code ***; > > data _null_; > if _n_ = 1 then do; > if 0 then set clientDiscount; > declare hash h(multidata:'y'); > h.definekey('clientid'); > h.definedata('fromdate','todate','clientDiscount') ; > h.definedone(); > do until(z); > set clientDiscount end = z; > if todate = . then todate = 1e6; > h.add(); > end; > end; > set salesHistory ; > _rc = h.find(); > _in_range = fromdate <= dateofsale <= todate; > h.has_next(result:_r); > do while(_r ne 0 and not _in_range); > h.find_next(); > _in_range = fromdate <= dateofsale <= todate; > h.has_next(result:_r) ; > end; > if not _in_range then clientDiscount = .; > drop _:; > run; > > Muthia Kachirayan > > > On 11/30/09, Keintz, H. Mark <mkeintz@wharton.upenn.edu> wrote: > > Rob: > > > > While karma responded with a nice demonstration of how to solve your > problem > > using the duplicate key hash feature in SAS 9.2, I'm not sure you > need, or > > even want the feature in this situation. > > > > After all this is a case of 50 million sales against 10,000 client > > discounts, with each discount record having its own date range. I > suspect > > it might be more efficient to key the client discount on both > CLIENTID AND > > DATEOFSALE, where DATEOFSALE is set to range from FROMDATE to TODATE. > > > > Let's say that the average FROMDATE-TODATE range is 4 years ==> > (about 1460 > > days including weekends). Then admittedly you would have to build a > hash > > table with 14,600,000 items instead of 10,000, (8 seconds vs 0.03 > seconds in > > my case). But once completed, the lookup speed is O(1), right? So > the main > > cost is memory. > > > > And what's the benefit? Avoiding 50,000,000 iterations of > > IF FROMDATE <= <= DATEOFSALE <= TODATE > > and avoiding the logic required to check for duplicate keys. > > > > > > I'd be interested in whether something like this (untested) code > provides > > much improvement in speed. No doubt if your date range of sales is > long > > enough, then the memory burden of the hash table overcomes the > performance > > improvement of a single lookup, and you would find duplicate keys > (with its > > smaller memory footprint) to be faster. > > > > > > data want; > > if 0 then set clientdiscount saleshistory; > > > > declare hash hdisc (hashexp:16); > > hdisc.definekey('clientid','dateofsale'); > > hdisc.definedata('clientdiscount'); > > hdisc.definedone(); > > > > ** Fill the lookup table **; > > do until (end_of_discounts); > > set clientdiscount end=end_of_discounts; > > do dateofsale = fromdate to todate; > > _rc=hdisc.add(); > > end; > > end; > > > > ** Use hash lookup (keyed on clientid/dateofsale) to get discounts > **: > > do until (end_of_sales); > > set salesHistory end=end_of_sales; > > clientdiscount=.; > > _rc = hdisc.find() ; > > output; > > end; > > > > drop fromdate todate _rc; > > run; > > > > > > Regards, > > Mark > > > > > >> -----Original Message----- > >> From: SAS(r) Discussion [mailto:SAS-L@LISTSERV.UGA.EDU] On Behalf Of > >> rob > >> Sent: Sunday, November 29, 2009 9:42 PM > >> To: SAS-L@LISTSERV.UGA.EDU > >> Subject: Hashing with SAS9.2 using multiple hash keys and a date > range > >> (Revisited) > >> > >> About 6 months ago I posted a message to this group enquiring about > >> hashing with a date range. > >> At the time we were running on SAS9.1.3. We ended up building a > >> temporary solution using one of the suggestions made by one of the > >> SAS_L repondents. The solution worked well and involved building a > >> format. > >> > >> We are now running on SAS9.2. We are embarking on some work which > >> means that the proc format > >> approach we employed is no longer suitable. > >> > >> Does anyone have any examples of how we can build a hash with > >> duplicate key values? The correct key to use being determined by two > >> date columns which provide the range. > >> > >> My origional posting (below) gives an example of the task. > >> > >> regards, > >> > >> rob ashmore > >> > >> > >> > >> > >> ****ORIGINAL POST START******* > >> Dear all, > >> > >> I wish to perform a many to many merge using a date range. > >> I could do this several ways using proc sort / data step or > >> perhaps even sql. > >> > >> > >> However, due to its large size, I dont want to sort the base > dataset. > >> As for the sql possibilites, I am not too sure about them. > >> > >> > >> I have supplied some example sas code below to illustrate what I am > >> trying to do. > >> > >> > >> I have two datasets saleshistory (50+ million rows) and > >> clientDiscountDtls (10,000 rows). > >> Basically, the task is to add the column clientDiscount to every row > >> of the salesHistory > >> dataset. The common key is clientId. However, the clientDiscount > >> should only be populated > >> if the dateOfSale falls between the fromDate and toDate of the > >> clientDiscountDtls dataset. > >> If there is no corresponding date range, the value of clientDiscount > >> is set to missing. > >> If there is no matching clientId in the clientDiscountDtls table > then > >> the record is still > >> outputted to the results dataset but the the value of > clientDiscount > >> is set to missing. > >> > >> > >> I am hoping that the above can be achieved using a hash because I > >> need > >> it to run pretty > >> quickly and repeatedly. > >> > >> > >> I have done a little research and have found that hashing using SAS > >> 9.1 will probably not be > >> useful for doing a many to many merge. However, we will soon be > >> moving > >> to SAS 9.2 and > >> I believe there are some new features in the hash data step > component > >> object which may > >> allow what I am trying to do. > >> > >> > >> Could anyone help by showing me how I can achieve this using a hash, > >> either 9.1 or 9.2. > >> Also, in the interests of my own education, Id be keen to see any > >> example of how to do this using > >> proc sql. I suspect it may well be quite straightforward? > >> > >> > >> *Sample Datsets*; > >> > >> > >> data work.salesHistory; > >> infile cards ; > >> length clientId $4 > >> productCode $2 > >> dateOfSale 8 > >> fullAmtPayable 8 > >> ;;; > >> > >> > >> input clientId $ > >> productCode $ > >> dateOfSale date9. > >> fullAmtPayable ; > >> cards; > >> C001 DX 22jan2001 100 > >> C002 KR 22mar2006 230 > >> C001 EZ 20sep2008 400 > >> C001 DX 25may2004 250 > >> C003 EK 15nov2008 300 > >> C002 MG 10feb2003 50 > >> ; > >> run; > >> > >> > >> data work.clientDiscount; > >> format fromdate todate date9.; > >> infile cards ; > >> length clientId $4 > >> fromDate 8 > >> toDate 8 > >> clientDiscount 8 > >> ;;; > >> > >> > >> input @1 clientId $ > >> @6 fromDate date7. > >> @15 toDate date7. > >> clientDiscount 8.2 > >> ; > >> cards; > >> C001 26dec00 01may04 0.1 > >> C001 01aug04 15oct07 0.25 > >> C001 16oct07 . 0.05 > >> C002 24feb03 01jan09 0.075 > >> ; > >> run; > >> > >> > >> **Desired Result Dataset**; > >> clientid / productcode / dateOfSale / fullAmtPayable / > clientDiscount > >> C001 DX 22jan2001 100 0.10 > >> C002 KR 22mar2006 230 0.075 > >> C001 EZ 20sep2008 400 0.05 > >> C001 DX 25may2004 250 . > >> C003 EK 15nov2008 300 . > >> C002 MG 10apr2003 50 . > >> > >> > >> etc etc > >> > >> > >> Thanks In Advance, > >> > >> > >> Rob Ashmore > >> > >> ****ORIGINAL POST END******* > >


Back to: Top of message | Previous page | Main SAS-L page