Date: Tue, 23 Aug 2011 01:29:53 -0400
Reply-To: Søren Lassen <s.lassen@POST.TELE.DK>
Sender: "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From: Søren Lassen <s.lassen@POST.TELE.DK>
Subject: Re: Many to many merge using Hash
Content-Type: text/plain; charset=ISO-8859-1
Hari,
You are basically doing a full join of the ALL table with itself.
In your case, I do not think that is necessary:
proc sort data=all out=purch(keep=id p_dt rename=(p_dt=dt));
by id p_dt;
run;
proc sort data=all out=disp(keep=id d_dt cnt rename=(d_dt=dt));
by id d_dt;
run;
data want;
set purch(in=p) disp(in=d);
by id dt;
if first.id then var1=0;
var1+p;
if d;
keep id cnt var1;
run;
proc sort;
by id cnt;
run;
In the example here, a purchase will be counted as before a disposal
on the same date. If you do not want that, you should change the line
set purch(in=p) disp(in=d);
to
set disp(in=d) purch(in=p);
Regards,
Søren
On Mon, 22 Aug 2011 17:04:57 -0400, Hari Nath <hari_s_nath@YAHOO.COM> wrote:
>Hi all,
> Following is my sample data and code. All am trying to find is the number
>of purchases(p_dt) prior to disposals(d_dt). My actual 'all' dataset has
>around 20 million records. The merge am trying to do here will go against
20
>million by 20 million and hence Hash is being used here. I had run this
code
>before and it ran for few hours which it never finished past the 'step
2'(so
>i had to kill the job).
> What i want to do is to combine 'step 3' in to 'step 2', which am not
able
>to accomplish as am a beginner with hashing. So this will cut down the time
>taken by 'step 2', once the 'step 3' is integrated with it.
> Please let me know with any suggestion/ ideas.
>
>Many Thanks.
>-Hari
>
>data all ;
> input id p_dt :date9. d_dt :date9. cnt $ ;
> format p_dt d_dt date9. ;
>cards ;
>101 '01jan2006'd '30dec2007'd 1
>101 '15feb2006'd '30jun2007'd 2
>101 '01jan2007'd '31dec2009'd 3
>101 '01jan2007'd '30jun2009'd 4
>101 '01jan2010'd '15aug2011'd 5
>101 '01jan2011'd '15aug2011'd 6
>101 '25feb2011'd '15aug2011'd 7
>102 '25mar2011'd '15aug2011'd 1
>102 '25apr2011'd '15aug2011'd 2
>;
>run ;
>
>/* STEP 1 */
>data purch ;
> set all ;
> b_cnt = cnt ;
> keep id p_dt b_cnt ;
>run ;
>data disp ;
> set all ;
> keep id d_dt cnt ;
>run ;
>
>/* STEP 2 */
>data datasetC ;
> declare hash hh(hashexp: 10) ;
> hh.definekey('id','_n_') ;
> hh.definedata('p_dt','b_cnt') ;
> hh.definedone() ;
>
> declare hash hs() ;
> hs.definekey('id') ;
> hs.definedata('_n_') ;
> hs.definedone() ;
>
> do until (eof1) ;
> set purch end = eof1 ;
> _n_ ++ 1 ;
> hs.replace() ;
> hh.add() ;
> end ;
>
> do until (eof2) ;
> set disp end = eof2 ;
> if hs.find() = 0 then do ;
> do _n_ = 1 to _n_ ;
> if hh.find() = 0 and p_dt < d_dt then do;
>
> output ;
> keep id b_cnt cnt ;
> end ;
> end ;
> end ;
> end ;
>stop ;
>run ;
>
>/* STEP 3 */
>proc sql noprint ;
> Create table datasetD as
> select id, cnt , count(*) as var1
> from datasetC
> group by id, cnt
> order by id, cnt
> ;
>quit ;
|