Date: Tue, 29 Jul 2003 08:52:52 -0700
Reply-To: "Huang, Ya" <yhuang@AMYLIN.COM>
Sender: "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From: "Huang, Ya" <yhuang@AMYLIN.COM>
Subject: Re: Count distinct in sas
Content-Type: text/plain; charset="iso-8859-1"
Nembra,
Here is what I just tested:
1 data xx;
2 do product=1 to int(ranuni(123)*1000);
3 do hhd_id=1 to int(ranuni(345)*1000);
4 do dummy=1 to int(ranuni(456)*10);
5 output;
6 end;
7 end;
8 end;
9 run;
NOTE: The data set WORK.XX has 1671518 observations and 3 variables.
NOTE: DATA statement used:
real time 2.28 seconds
cpu time 0.58 seconds
10
11 proc sql;
12 create table t1 as
13 select product,count(distinct hhd_id) as cnt
14 from xx
15 group by product
16 ;
NOTE: Table WORK.T1 created, with 750 rows and 2 columns.
17
NOTE: PROCEDURE SQL used:
real time 13.57 seconds
cpu time 12.50 seconds
18 proc sql;
19 create table t2 as
20 select distinct product,hhd_id
21 from xx
22 ;
NOTE: Table WORK.T2 created, with 334291 rows and 2 columns.
23 create table t2 as
24 select distinct product,count(*) as cnt
25 from t2
26 group by product
27 ;
NOTE: Table WORK.T2 created, with 750 rows and 2 columns.
28 quit;
NOTE: PROCEDURE SQL used:
real time 3.91 seconds
cpu time 3.53 seconds
--------------------------------
Note, the first sql step is in the same format as yours,
which took 13.57 seconds to run. The second sql step
has two pass, but it took only 3.91 seconds to run. Both of them
got the same result. It seems to me that you may save
big time by using the second format. Hopefully,
3.91/13.57*9hours=2.59hours !!
Kind regards,
Ya Huang
-----Original Message-----
From: Nembra [mailto:jesus.garcia@DUNNHUMBY.COM]
Sent: Tuesday, July 29, 2003 7:45 AM
To: SAS-L@LISTSERV.UGA.EDU
Subject: Count distinct in sas
Hi,
I am working with datasets that have around 400 millions of
transactions (grocery ones) and I need to aggregate households.
I have used proc sql and it takes 9 hours.
Then, I used sort + first and it took 4 hours.
Do you know a proc that count distinct values doing a group by?
My steps have been:
1. proc sql;
create table sol
as
select product,
count(distinct hhd_id)
from initial
group by product;
quit;
2. proc sort data=initial;
by product hhd_id;
run;
data sol;
retain cust 0;
set initial;
by product hhd_id;
if first.product then cust=0;
if first.hhd_id then cust +1;
if last.hhd_id then output;
run;
Cheers.
J.