LISTSERV at the University of Georgia
Menubar Imagemap
Home Browse Manage Request Manuals Register
Previous messageNext messagePrevious in topicNext in topicPrevious by same authorNext by same authorPrevious page (September 2009, week 5)Back to main SAS-L pageJoin or leave SAS-L (or change settings)ReplyPost a new messageSearchProportional fontNon-proportional font
Date:         Tue, 29 Sep 2009 19:36:24 -0400
Reply-To:     Muthia Kachirayan <muthia.kachirayan@GMAIL.COM>
Sender:       "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From:         Muthia Kachirayan <muthia.kachirayan@GMAIL.COM>
Subject:      Re: Cross-Summing problem
Comments: To: "Keintz, H. Mark" <mkeintz@wharton.upenn.edu>
In-Reply-To:  <FEE685C811A7E44AAD17E47B2A966E29BAD35E2A05@KITE.wharton.upenn.edu>
Content-Type: text/plain; charset=ISO-8859-1

Mark,

The use of double iterator for a single hash object with clear() method is very great.

Thanks.

Muthia Kachirayan

On Tue, Sep 29, 2009 at 5:35 PM, Keintz, H. Mark <mkeintz@wharton.upenn.edu>wrote:

> > Here's a solution that accumulates all the desired company counts customer > by customer. For each CID it populates a small hash table of AMT by > company. At the end of each CID it adds the counts to a larger (but still > very small) hash table of totals amounts. > > Note the use of 2 hash iterators applied to the same hash object. > > Regards, > Mark > > > > data _null_; > /* Total AMT for each COMPANY, for customers who also bought from COMPANY2 > */ > dcl hash totals (ordered:'a'); > totals.definekey('company2','company'); > totals.definedata('company2','company','total_amt'); > totals.definedone(); > > /** AMT, by company, for the current CID **/ > dcl hash cid_amt(); > cid_amt.definekey('company'); > cid_amt.definedata('company','amt'); > cid_amt.definedone(); > dcl hiter ci1('cid_amt'); > dcl hiter ci2('cid_amt'); > > do until (end_of_data); > ** For each CID, retrieve all AMTs, by company, into CID_AMT **; > do until (last.cid); > set have end=end_of_data; > by cid; > cid_amt.add(); > end; > > do i=ci2.first() by 0 while (i=0); ** For each company in CID_AMT ... > **; > company2=company; ** Set it as company2 **; > do j=ci1.first()by 0 while (j=0); ** For this COMPANY2, pass through > all COMPANYs **; > total_amt=0; > rt=totals.find(); ** Get current TOTAL_AMT > **; > total_amt=sum(total_amt,amt); ** Update TOTAL_AMT **; > rt=totals.replace(); ** .. and put it back into > TOTALS **; > j=ci1.next(); ** next company **; > end; > i=ci2.next(); ** next company2 **; > end; > cid_amt.clear(); > end; ** until end_of_data **; > > totals.output(dataset:'TEMP'); > run; > > > > > -----Original Message----- > > From: SAS(r) Discussion [mailto:SAS-L@LISTSERV.UGA.EDU] On Behalf Of > > Muthia Kachirayan > > Sent: Monday, September 28, 2009 11:58 PM > > To: SAS-L@LISTSERV.UGA.EDU > > Subject: Re: Cross-Summing problem > > > > For those who happen to read this thread may find it easy to follow the > > sequence in one place. > > > > Paul Phan posted the following on Thu, Sep 24, 2009 at 4:49 AM: > > > > HAVE: > > > > Row(R) CID COMPANY Amt > > 1 111 A 4 > > 2 111 D 2 > > 3 111 G 3 > > 4 111 X 4 > > 5 112 A 6 > > 6 112 D 2 > > 7 112 G 8 > > 8 113 G 4 > > 9 113 X 2 > > 10 114 A 5 > > 11 115 X 5 > > > > WANT: > > A D G X > > A 15 4 11 4 > > D 10 4 11 4 > > G 10 4 15 6 > > X 4 2 7 11 > > > > Ya Haung responded with the Proc SQL solution: > > > > proc sql; > > create table xx1 as > > select distinct a.company as row,b.company as col,sum(b.amt) as tot > > from xx a, xx b > > where a.cid=b.cid > > group by a.company,b.company > > order by a.company,b.company > > ; > > proc transpose data=xx1 out=xx2 (drop=_name_); > > by row; > > id col; > > var tot; > > run; > > > > Paul Dorfman improved the speed of the Proc SQL with code: > > > > proc sql ; > > create table pre_sum as > > select company > > , cid > > , sum (amt) as pre_sum > > , count (amt) as pre_cnt > > from test > > group company > > , cid > > ; > > create table totals as > > select a.company as row > > , b.company as col > > , sum (b.pre_sum) as sum > > , count (b.pre_cnt) as cnt > > from pre_sum a, pre_sum b > > where a.cid = b.cid > > group a.company > > , b.company > > order a.company > > , b.company > > ; > > quit ; > > proc transpose data = totals out=sum_tran (drop=_name_) ; > > by row ; > > id col ; > > var sum ; > > run ; > > proc transpose data = totals out=cnt_tran (drop=_name_) ; > > by row ; > > id col ; > > var cnt ; > > run ; > > > > Wei Wang posted a hash solution with : > > > > *assuming data is not aggregated at the customer and company level; > > data pre_sum_hash (rename = (company = row) drop = amt) > > pre_sum_hash2 (rename = (company = col) drop = amt); > > ********************************************************** > > Define variables > > **********************************************************; > > if 0 then set xx; > > ********************************************************** > > Define Hash > > **********************************************************; > > dcl hash h(hashexp:4, ordered: 'a'); > > h.defineKey('CID', 'COMPANY'); > > h.defineData('CID', 'COMPANY', 'presum', 'precnt'); > > h.defineDone(); > > ********************************************************** > > Calculate Sum and count > > **********************************************************; > > do while(not done); > > set xx end=done; > > by cid company; > > if h.find() ne 0 then do; > > presum = 0; > > precnt = 0; > > end; > > presum + amt; > > precnt + 1; > > if last.company then output; > > end; > > ********************************************************** > > Output > > **********************************************************; > > stop; > > run; > > > > *get the cartesian join in data step; > > *it might be possible to conbine the next two steps into one; > > *for readability, I keep them seperated; > > data _null_; > > if 0 then set pre_sum_hash2; > > ********************************************************** > > Define Hashes > > **********************************************************; > > dcl hash h (dataset: 'pre_sum_hash2', hashexp:4, ordered: 'Y'); > > h.defineKey('CID', 'col'); > > h.defineDone(); > > dcl hash hout (ordered: 'a'); > > hout.defineKey('row', 'col', 'n'); > > hout.definedata('row', 'col', 'presum', 'precnt'); > > hout.defineDone(); > > do while(not done); > > set pre_sum_hash end=done; > > HoldCID = CID; > > declare hiter ih("h"); > > rc=ih.first(); > > do while(rc=0); > > If CID=HoldCID then do; > > hout.add(); > > n+1; > > end; > > rc=ih.next(); > > end; > > end; > > hout.output(dataset : 'CartesianJoin'); > > stop; > > drop holdcid rc; > > run; > > > > data hash_totals; > > set CartesianJoin; > > by row col; > > if first.col then do; > > sum = 0; > > cnt = 0; > > end; > > sum + presum; > > cnt + precnt; > > if last.col then output; > > drop presum precnt; > > run; > > > > Paul Phan replied with: > > > > "Thank you Paul and Wei for your help. I was able to run Paul code but > > the > > data hash solution from Wei generate this error message > > ERROR: Hash object added 2228208 items when memory failure occurred. > > FATAL: Insufficient memory to execute data step program. Aborted during > > the EXECUTION phase." > > > > This shows that the hash object could not find sufficient memory to > > hold > > data. > > > > Another version of hash is given below which will provide an output > > similar > > to the PROC SQL code given by Paul. Further, using Proc Tanspose the > > output > > table can be processed to give square matrices of SUMS as well as > > COUNTS or > > using other datastep. > > > > Paul Phan may check this code with the original data. My guess is that > > when > > Paul's Proc SQL has done the job, this will do the job in a slightly > > lesser > > time. > > > > > > > > data test (drop = Row); > > input Row CID COMPANY $ Amt ; > > cards ; > > 1 111 A 4 > > 2 111 D 2 > > 3 111 G 3 > > 4 111 X 4 > > 5 112 A 6 > > 6 112 D 2 > > 7 112 G 8 > > 8 113 G 4 > > 9 113 X 2 > > 10 114 A 5 > > 11 115 X 5 > > ; > > run ; > > > > The following is oneway to get the unique number of Companies. One may > > use > > datastep or Proc sql for this. > > > > data _null_; > > if _n_ = 1 then do; > > declare hash h(); > > h.definekey('company'); > > h.definedone(); > > end; > > do until(eof); > > set test (keep = company) end = eof; > > if h.find() ne 0 then h.add(); > > end; > > call symputx('num_companies',h.num_items); > > run; > > %put &num_companies; > > > > The fullstimer option may be chosen to see the memory usage. > > > > The dataset, TEST, must be presorted by CID. Within each CID, Company > > and > > Amt are stored into temporary arrays, TR[ ] and TAMT[ ]. Array TR[ ] is > > used > > to make the cartesian products and simultaneously TOT_AMT and COUNT are > > accumulated and saved into the hash table. Note the use of NROWS which > > helps > > to know how many elements are used for the current CID to place into > > the > > array, TR[ ] saving us from reinitializing it for every CID. > > > > > > *options fullstimer; > > data _null_; > > if _n_ = 1 then do; > > if 0 then set test; > > declare hash h(ordered:'a'); > > h.definekey('row','col'); > > h.definedata('row','col','tot_amt','count'); > > h.definedone(); > > end; > > do until(eof); > > nrows = 0; > > do until(last.CID); > > set test (keep = cid company amt rename = (company = row)) end > > = eof > > ; > > by CID; > > array tr[&num_companies.]$ _temporary_; > > array tamt[&num_companies.] _temporary_; > > nrows + 1; > > tr[nrows] = row; > > tamt[nrows] = amt; > > end; > > > > do j = 1 to nrows; > > row = tr[j]; > > do k = 1 to nrows; > > col = tr[k]; > > if h.find() ne 0 then do; > > tot_amt = 0; > > count = 0; > > end; > > tot_amt + tamt[k]; > > count + 1; > > h.replace(); > > end; > > end; > > end; > > h.output(dataset:'want'); > > stop; > > run; > > > > Muthia Kachirayan >


Back to: Top of message | Previous page | Main SAS-L page