Date: Mon, 22 Dec 2008 14:29:00 -0600
Reply-To: Gregg Snell <sas-l@DATASAVANTCONSULTING.COM>
Sender: "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From: Gregg Snell <sas-l@DATASAVANTCONSULTING.COM>
Subject: Re: More efficient way to sort/subset a large dataset
Content-Type: text/plain; charset=iso-8859-1
Ya,
Here is a hashing solution that took just over 4 seconds to create
work.yy from your example. You may run into memory issues when the
product of (xx vars)*(xx var lengths)*(unique subset id's) exceeds your
system limits, but your example of 53*8*200 was certainly within limits.
data xx0/view=xx0;
retain id_count 0;
set xx(obs=0);
run;
data _null_;
length id id_count 8;
retain id_count 1;
declare hash h_sortsub();
rc=h_sortsub.DefineKey('id');
rc=h_sortsub.DefineData('id_count');
rc=h_sortsub.DefineDone();
do while (not eof1);
set subset(keep=id) end=eof1;
rc=h_sortsub.add();
end;
declare hash h_yy(dataset:'xx0');
rc=h_yy.DefineKey('id','id_count');
rc=h_yy.DefineDone();
do while (not eof2);
set xx end=eof2;
if h_sortsub.find()=0 then do;
h_yy.add();
id_count+1;
h_sortsub.replace();
end;
end;
h_yy.output(dataset:'yy');
stop;
run;
Merry Christmas!
Gregg Snell
> Hi there,
>
> Assuming I have a non-sorted large dataset, I want to sort it
> by an id var. In the meantime subset it to certain ids. What's the
> better way than using where statement in the dataset option for
> the subsetting?
>
> I can't use view. I don't think sort the whole data than merge
> with a subset data would do any better.
>
> Another issue is that the id list (a macro var, see below sample)
could
> be vary long. When id is string, the macro var may cause overflow
> due to memory limit.
>
> Thanks
>
> Ya
>
> My current code is like below:
>
> libname c "c:\temp";
>
> /* create testing data
> data c.xx;
> array vv(50) v1-v50;
> do id=1 to 1000;
> do i=1 to int(ranuni(1)*100);
> do j=1 to 50;
> vv(j)=ranuni(2);
> output;
> end;
> end;
> end;
> run;
> */
>
> * create id list;
> data subset;
> do i=1 to 200;
> id=int(ranuni(2)*1000);
> output;
> end;
> run;
>
> proc sql noprint;
> select distinct id into :idlst separated by ','
> from subset
> ;
>
> %put &idlst;
>
> * sort and subsetting;
> proc sort data=c.xx (where=(id in (&idlst)))
> out=yy;
> by id;
> run;
>
>
--
NeoMail - http://neomail.sourceforge.net
|