Date: Tue, 1 Apr 1997 12:44:12 -0500
Reply-To: "Rickards, Clinton S" <RickardsCS@AETNA.COM>
Sender: "SAS(r) Discussion" <SAS-L@UGA.CC.UGA.EDU>
From: "Rickards, Clinton S" <RickardsCS@AETNA.COM>
Subject: FW: FW: why did this code take 22 hours to run?
Sometimes you can't win for losing. My most sincere apologies to Laurie;
Ian suggests that next time I go for age as well but I don't think I
could pull that off successfully.
Ian, I'm very surprised that the KEEP= option doesn't work on the input
file to a SORT. I have always thought that dataset options like KEEP,
DROP, RENAME, etc work the same way in DATA steps and PROCs. Can you
clarify the situation a bit more, please. Is this peculiar to PROC SORT
or the KEEP option? Are other procs similarly affected?
Thanks...
Clint
Clint Rickards, ARS IT, TNB2
Phone: 860-273-3420
>----------
>From: whitloi1@westatpo.westat.com[SMTP:whitloi1@westatpo.westat.com]
>Sent: Tue, Apr 01, 1997 10:39 am
>To: Rickards, Clinton S
>Subject: Re: FW: why did this code take 22 hours to run?
>
> Clinton,
>
> You not only changed Laurie to Linda. You also changed *his* sex. The
> next thing is to go after his age.
>
> In fact the use of a view might help. The KEEP option on the input
> data set to a sort is not actually implemented until the output.
> (Don't ask me why? I find it most annoying.) Hence the view makes the
> keeping take place before the sort and will save time when the file is
> large enough and there are enough variables. I don't know whether 24
> to 1 is a big enough ratio or not.
>
> Ian Whitlock <whitloi@westat.com>
>
>
>______________________________ Reply Separator
>_________________________________
>Subject: FW: why did this code take 22 hours to run?
>Author: "Rickards, Clinton S" <RickardsCS@AETNA.COM> at internet-e-mail
>Date: 4/1/97 11:29 AM
>
>
>Linda Fleming postulated that a view and sort would reduce the time to
>find the keys that Tod Mijanovich was looking for. Her solution is
>certain correct but there really is no need for a view; just sorting
>directly will do the trick:
>
> proc sort data=ercpiv (keep=cpi_key) /* <== keep only the key!! */
> out=hhc.ercpi
> nodupkey noequals;
> by cpi_key;
> run;
>
>Clint
>Clint Rickards, ARS IT, TNB2
>Phone: 860-273-3420
>
>>----------
>>From: lfleming@actrix.gen.nz[SMTP:lfleming@actrix.gen.nz]
>>Sent: Fri, Mar 21, 1997 1:52 pm
>>To: SAS-L@VTVM1.CC.VT.EDU
>>Subject: Re: why did this code take 22 hours to run?
>>
>>> Does anyone have an inkling about why the second data step below took
>>> over 22 hours to run? The first data step subsets a 4.5 million
>>> record dataset down to 3.4 million records, and builds an index, in 6
>>> minutes. The second data step uses the index to further subset the
>>> data down to 1.3 million records, but took 22 hours! This was run on
>>> a single-user Pentium Pro 200 with SAS 6.12 for OS/2, Warp 4, and many
>>> gigs of free disk space in a RAID-0 array and 128mb of RAM. The
>>> machine was not used for anything else during the run time.
>>>
>>> I regularly run involved data steps on 20 million-record datasets and
>>> a run has never taken longer than about 8 hours. Maybe the following
>>> is a factor: my bufsize is 64K and bufno is 32. (But I set them this
>>> way because these were the winning numbers in benchmark testing.)
>>>
>>> Any ideas? Thanks.
>>>
>>> Tod Mijanovich
>>>
>>>
>>> 1722 data hhc.er(INDEX=(CPI_KEY));
>>> 1723 set hhc.iper2(where=(clngrp ne 'HO'));
>>> 1724 RUN;
>>>
>>> NOTE: The data set HHC.ER has 3400284 observations and 24 variables.
>>> NOTE: The DATA statement used 6 minutes 23.19 seconds.
>>>
>>> 1725
>>> 1726 DATA HHC.ERCPI;
>>> 1727 SET HHC.ER;
>>> 1728 BY CPI_KEY;
>>> 1729 IF FIRST.CPI_KEY;
>>> 1730 KEEP CPI_KEY;
>>> 1731 RUN;
>>>
>>> NOTE: The data set HHC.ERCPI has 1345222 observations and 1 variables.
>>> NOTE: The DATA statement used 22 hours 21 minutes 15.77 seconds.
>>> .
>>
>>Forcing hits on the index on *every* single observation means that you're
>>going
>>to
>>be making at least 6.8 Million reads. Since you're only keeping one
>>variable,
>>sorting by that variable out to ERCPI, with nodupkey, will accomplish the
>>same
>>thing and be more efficient. Try:
>>
>>/*
>> Create view of dataset, with only one variable.
>>*/
>>data ercpiv / view=ercpiv;
>>set hhc.er(keep=cpi_key);
>>
>>/*
>> Sort view out to hhc.ercpi, keeping only unique values of key.
>>*/
>>proc sort data=ercpiv out=hhc.ercpi nodupkey noequals;
>>by cpi_key;
>>
>>/*
>> Tidy up.
>>*/
>>proc datasets lib=work nolist nowarn;
>>delete ercpiv;
>>quit;
>>
>>This streams the values of cpi_key into the sort, and keeps unique values. I
>>guarantee (a bottle of Steinlager?) that it will run faster. And it will
>>save
>>heaps of space as well. Using a view is faster and cheaper on space than
>>creating
>>a temporary dataset - counter-intuitive, but it does work.
>>
>>Bufno only makes minimal difference, and bufsize even less (in my
>>experience).
>>Preferably put bufsize to zero, and let SAS's default take over. It's almost
>>always the one of the least important factors.
>>
>>Laurie Fleming | On a clear disk, | (+64 4) 479-1589
>>4 Kenya St | you can seek forever. | (+64 21) 688-140
>>Ngaio | | lfleming@actrix.gen.nz
>>Wellington 6004 | | flemingl@ho.acc.org.nz
>>New Zealand | |
>>
>
|