LISTSERV at the University of Georgia
Menubar Imagemap
Home Browse Manage Request Manuals Register
Previous messageNext messagePrevious in topicNext in topicPrevious by same authorNext by same authorPrevious page (April 1997, week 1)Back to main SAS-L pageJoin or leave SAS-L (or change settings)ReplyPost a new messageSearchProportional fontNon-proportional font
Date:         Tue, 1 Apr 1997 12:44:12 -0500
Reply-To:     "Rickards, Clinton S" <RickardsCS@AETNA.COM>
Sender:       "SAS(r) Discussion" <SAS-L@UGA.CC.UGA.EDU>
From:         "Rickards, Clinton S" <RickardsCS@AETNA.COM>
Subject:      FW: FW: why did this code take 22 hours to run?

Sometimes you can't win for losing. My most sincere apologies to Laurie; Ian suggests that next time I go for age as well but I don't think I could pull that off successfully.

Ian, I'm very surprised that the KEEP= option doesn't work on the input file to a SORT. I have always thought that dataset options like KEEP, DROP, RENAME, etc work the same way in DATA steps and PROCs. Can you clarify the situation a bit more, please. Is this peculiar to PROC SORT or the KEEP option? Are other procs similarly affected?

Thanks... Clint Clint Rickards, ARS IT, TNB2 Phone: 860-273-3420

>---------- >From: whitloi1@westatpo.westat.com[SMTP:whitloi1@westatpo.westat.com] >Sent: Tue, Apr 01, 1997 10:39 am >To: Rickards, Clinton S >Subject: Re: FW: why did this code take 22 hours to run? > > Clinton, > > You not only changed Laurie to Linda. You also changed *his* sex. The > next thing is to go after his age. > > In fact the use of a view might help. The KEEP option on the input > data set to a sort is not actually implemented until the output. > (Don't ask me why? I find it most annoying.) Hence the view makes the > keeping take place before the sort and will save time when the file is > large enough and there are enough variables. I don't know whether 24 > to 1 is a big enough ratio or not. > > Ian Whitlock <whitloi@westat.com> > > >______________________________ Reply Separator >_________________________________ >Subject: FW: why did this code take 22 hours to run? >Author: "Rickards, Clinton S" <RickardsCS@AETNA.COM> at internet-e-mail >Date: 4/1/97 11:29 AM > > >Linda Fleming postulated that a view and sort would reduce the time to >find the keys that Tod Mijanovich was looking for. Her solution is >certain correct but there really is no need for a view; just sorting >directly will do the trick: > > proc sort data=ercpiv (keep=cpi_key) /* <== keep only the key!! */ > out=hhc.ercpi > nodupkey noequals; > by cpi_key; > run; > >Clint >Clint Rickards, ARS IT, TNB2 >Phone: 860-273-3420 > >>---------- >>From: lfleming@actrix.gen.nz[SMTP:lfleming@actrix.gen.nz] >>Sent: Fri, Mar 21, 1997 1:52 pm >>To: SAS-L@VTVM1.CC.VT.EDU >>Subject: Re: why did this code take 22 hours to run? >> >>> Does anyone have an inkling about why the second data step below took >>> over 22 hours to run? The first data step subsets a 4.5 million >>> record dataset down to 3.4 million records, and builds an index, in 6 >>> minutes. The second data step uses the index to further subset the >>> data down to 1.3 million records, but took 22 hours! This was run on >>> a single-user Pentium Pro 200 with SAS 6.12 for OS/2, Warp 4, and many >>> gigs of free disk space in a RAID-0 array and 128mb of RAM. The >>> machine was not used for anything else during the run time. >>> >>> I regularly run involved data steps on 20 million-record datasets and >>> a run has never taken longer than about 8 hours. Maybe the following >>> is a factor: my bufsize is 64K and bufno is 32. (But I set them this >>> way because these were the winning numbers in benchmark testing.) >>> >>> Any ideas? Thanks. >>> >>> Tod Mijanovich >>> >>> >>> 1722 data hhc.er(INDEX=(CPI_KEY)); >>> 1723 set hhc.iper2(where=(clngrp ne 'HO')); >>> 1724 RUN; >>> >>> NOTE: The data set HHC.ER has 3400284 observations and 24 variables. >>> NOTE: The DATA statement used 6 minutes 23.19 seconds. >>> >>> 1725 >>> 1726 DATA HHC.ERCPI; >>> 1727 SET HHC.ER; >>> 1728 BY CPI_KEY; >>> 1729 IF FIRST.CPI_KEY; >>> 1730 KEEP CPI_KEY; >>> 1731 RUN; >>> >>> NOTE: The data set HHC.ERCPI has 1345222 observations and 1 variables. >>> NOTE: The DATA statement used 22 hours 21 minutes 15.77 seconds. >>> . >> >>Forcing hits on the index on *every* single observation means that you're >>going >>to >>be making at least 6.8 Million reads. Since you're only keeping one >>variable, >>sorting by that variable out to ERCPI, with nodupkey, will accomplish the >>same >>thing and be more efficient. Try: >> >>/* >> Create view of dataset, with only one variable. >>*/ >>data ercpiv / view=ercpiv; >>set hhc.er(keep=cpi_key); >> >>/* >> Sort view out to hhc.ercpi, keeping only unique values of key. >>*/ >>proc sort data=ercpiv out=hhc.ercpi nodupkey noequals; >>by cpi_key; >> >>/* >> Tidy up. >>*/ >>proc datasets lib=work nolist nowarn; >>delete ercpiv; >>quit; >> >>This streams the values of cpi_key into the sort, and keeps unique values. I >>guarantee (a bottle of Steinlager?) that it will run faster. And it will >>save >>heaps of space as well. Using a view is faster and cheaper on space than >>creating >>a temporary dataset - counter-intuitive, but it does work. >> >>Bufno only makes minimal difference, and bufsize even less (in my >>experience). >>Preferably put bufsize to zero, and let SAS's default take over. It's almost >>always the one of the least important factors. >> >>Laurie Fleming | On a clear disk, | (+64 4) 479-1589 >>4 Kenya St | you can seek forever. | (+64 21) 688-140 >>Ngaio | | lfleming@actrix.gen.nz >>Wellington 6004 | | flemingl@ho.acc.org.nz >>New Zealand | | >> >


Back to: Top of message | Previous page | Main SAS-L page