Date: Tue, 1 Apr 1997 12:44:12 -0500
Reply-To: "Rickards, Clinton S" <RickardsCS@AETNA.COM>
Sender: "SAS(r) Discussion" <SAS-L@UGA.CC.UGA.EDU>
From: "Rickards, Clinton S" <RickardsCS@AETNA.COM>
Subject: FW: FW: why did this code take 22 hours to run?
Sometimes you can't win for losing. My most sincere apologies to Laurie;
Ian suggests that next time I go for age as well but I don't think I
could pull that off successfully.
Ian, I'm very surprised that the KEEP= option doesn't work on the input
file to a SORT. I have always thought that dataset options like KEEP,
DROP, RENAME, etc work the same way in DATA steps and PROCs. Can you
clarify the situation a bit more, please. Is this peculiar to PROC SORT
or the KEEP option? Are other procs similarly affected?
Clint Rickards, ARS IT, TNB2
>Sent: Tue, Apr 01, 1997 10:39 am
>To: Rickards, Clinton S
>Subject: Re: FW: why did this code take 22 hours to run?
> You not only changed Laurie to Linda. You also changed *his* sex. The
> next thing is to go after his age.
> In fact the use of a view might help. The KEEP option on the input
> data set to a sort is not actually implemented until the output.
> (Don't ask me why? I find it most annoying.) Hence the view makes the
> keeping take place before the sort and will save time when the file is
> large enough and there are enough variables. I don't know whether 24
> to 1 is a big enough ratio or not.
> Ian Whitlock <firstname.lastname@example.org>
>______________________________ Reply Separator
>Subject: FW: why did this code take 22 hours to run?
>Author: "Rickards, Clinton S" <RickardsCS@AETNA.COM> at internet-e-mail
>Date: 4/1/97 11:29 AM
>Linda Fleming postulated that a view and sort would reduce the time to
>find the keys that Tod Mijanovich was looking for. Her solution is
>certain correct but there really is no need for a view; just sorting
>directly will do the trick:
> proc sort data=ercpiv (keep=cpi_key) /* <== keep only the key!! */
> nodupkey noequals;
> by cpi_key;
>Clint Rickards, ARS IT, TNB2
>>Sent: Fri, Mar 21, 1997 1:52 pm
>>Subject: Re: why did this code take 22 hours to run?
>>> Does anyone have an inkling about why the second data step below took
>>> over 22 hours to run? The first data step subsets a 4.5 million
>>> record dataset down to 3.4 million records, and builds an index, in 6
>>> minutes. The second data step uses the index to further subset the
>>> data down to 1.3 million records, but took 22 hours! This was run on
>>> a single-user Pentium Pro 200 with SAS 6.12 for OS/2, Warp 4, and many
>>> gigs of free disk space in a RAID-0 array and 128mb of RAM. The
>>> machine was not used for anything else during the run time.
>>> I regularly run involved data steps on 20 million-record datasets and
>>> a run has never taken longer than about 8 hours. Maybe the following
>>> is a factor: my bufsize is 64K and bufno is 32. (But I set them this
>>> way because these were the winning numbers in benchmark testing.)
>>> Any ideas? Thanks.
>>> Tod Mijanovich
>>> 1722 data hhc.er(INDEX=(CPI_KEY));
>>> 1723 set hhc.iper2(where=(clngrp ne 'HO'));
>>> 1724 RUN;
>>> NOTE: The data set HHC.ER has 3400284 observations and 24 variables.
>>> NOTE: The DATA statement used 6 minutes 23.19 seconds.
>>> 1726 DATA HHC.ERCPI;
>>> 1727 SET HHC.ER;
>>> 1728 BY CPI_KEY;
>>> 1729 IF FIRST.CPI_KEY;
>>> 1730 KEEP CPI_KEY;
>>> 1731 RUN;
>>> NOTE: The data set HHC.ERCPI has 1345222 observations and 1 variables.
>>> NOTE: The DATA statement used 22 hours 21 minutes 15.77 seconds.
>>Forcing hits on the index on *every* single observation means that you're
>>be making at least 6.8 Million reads. Since you're only keeping one
>>sorting by that variable out to ERCPI, with nodupkey, will accomplish the
>>thing and be more efficient. Try:
>> Create view of dataset, with only one variable.
>>data ercpiv / view=ercpiv;
>> Sort view out to hhc.ercpi, keeping only unique values of key.
>>proc sort data=ercpiv out=hhc.ercpi nodupkey noequals;
>> Tidy up.
>>proc datasets lib=work nolist nowarn;
>>This streams the values of cpi_key into the sort, and keeps unique values. I
>>guarantee (a bottle of Steinlager?) that it will run faster. And it will
>>heaps of space as well. Using a view is faster and cheaper on space than
>>a temporary dataset - counter-intuitive, but it does work.
>>Bufno only makes minimal difference, and bufsize even less (in my
>>Preferably put bufsize to zero, and let SAS's default take over. It's almost
>>always the one of the least important factors.
>>Laurie Fleming | On a clear disk, | (+64 4) 479-1589
>>4 Kenya St | you can seek forever. | (+64 21) 688-140
>>Ngaio | | email@example.com
>>Wellington 6004 | | firstname.lastname@example.org
>>New Zealand | |