Date: Mon, 2 Feb 1998 18:50:42 0500
ReplyTo: Undetermined origin c/o LISTSERV administrator
<ownerLISTSERV@UGA.CC.UGA.EDU>
Sender: "SAS(r) Discussion" <SASL@UGA.CC.UGA.EDU>
From: Undetermined origin c/o LISTSERV administrator
<ownerLISTSERV@UGA.CC.UGA.EDU>
Sorting and indexing don't help when you have to transverse the
entire data set. Right? So, you shouldn't be all that surprised at
your results. To get the minimum and maximum values from a sorted data
set, would you be allowed to do the following:
/* Assumes FOO is sorted by x */
data stats( keep = x_mn x_mx );
set foo nobs = lastrec;
x_mn = x;
set foo point = lastrec;
x_mx = x;
output;
stop;
run;

Jack Shoemaker / Oxford Specialty Management / JShoemak@oxhp.com
>
>From: Karsten M. Self[SMTP:kmself@IX.NETCOM.COM]
>Sent: Thursday, January 29, 1998 3:32 AM
>To: SASL@UGA.CC.UGA.EDU
>Subject: A lament on SQL nonoptimization in MIN, MAX query
>
>This is more a lament than a request for suggestions, though any
>improvements or explanations would be appreciated (birdies, little or
>otherwise, included).
>
>I've inherited a program which is particularly extravagant in its use of
>I/O. Barring reengineering (a cause I champion), I'm stuck with it.
> Among the steps it does is to get the maximum and minimum values of a
>variable, repeatedly, from midlin' to largish datasets, so this stats
>generation costs 615 minutes of runtime. Despite my frustrations, I wish
>the original programmer a full recovery from his fractured vertebra.
>
>(Note that I'm not looking for alternate methods of getting these
>statistics  the obvious thing to do is calculate the values once and
>reuse them. As stated above, this is the identical calculation repeated
>many times.)
>
>
>I looked into replacing the PROC MEANS with SQL, and found that this was
>less efficient (first surprise). I then discovered that neither indexing
>nor sorting the input dataset by the analysis variable has any impact on
>either MEANS or SQL performance in selecting max and min values. What a
>crying shame!
>
>Following code illustrates example running under NT. Similar results are
>observed with Sun Solaris. Note that runtime is 1.5  2 times slower for
>SQL over MEANS.
>
>
>263 %inc pgm /source2;
>NOTE: %INCLUDE (level 1) file PGM is file
>h:\Karsten\Computer\SAS\SasLMisc\sql_index.sas.
>264 +/* Why does SQL not optimize this query using index or
>265 + * sort sequence?
>266 + */
>267 +
>268 +
>269 +data foo;
>270 + array y{10};
>271 + do i= 1 to 100000;
>272 + x= ranuni(65532);
>273 + do iY= lbound(y) to hbound(y);
>274 + y{iy} = ranuni(65532);
>275 + end;
>276 + output;
>277 + end;
>278 + run;
>
>NOTE: The data set WORK.FOO has 100000 observations and 13 variables.
>NOTE: The DATA statement used 12.49 seconds.
>
>
>279 +
>280 +
>281 +/* Get min and max */
>282 +title "Unsorted, unindexed";
>283 +proc means data= foo min max;
>284 + var x;
>285 + run;
>
>NOTE: The PROCEDURE MEANS used 0.53 seconds.
>
>
>286 +
>287 +proc sql;
>288 + select min(x), max(x) from foo;
>289 + quit;
>NOTE: The PROCEDURE SQL used 1.02 seconds.
>
>
>290 +
>291 +
>292 +
>293 +/* Index */
>294 +proc sql;
>295 + create index x on foo(x);
>NOTE: Simple index X has been defined.
>296 + quit;
>NOTE: The PROCEDURE SQL used 3.23 seconds.
>
>
>297 +
>298 +
>299 +/* Get min and max */
>300 +title "Unsorted, indexed";
>301 +proc means data= foo min max;
>302 + var x;
>303 + run;
>
>NOTE: The PROCEDURE MEANS used 0.74 seconds.
>
>
>304 +
>305 +proc sql;
>306 + select min(x), max(x) from foo;
>307 + quit;
>NOTE: The PROCEDURE SQL used 1.22 seconds.
>
>
>308 +
>309 +
>310 +
>311 +/* Sort */
>312 +proc sort data= foo force;
>313 + by x;
>314 + run;
>
>NOTE: SAS sort was used.
>NOTE: The data set WORK.FOO has 100000 observations and 13 variables.
>NOTE: The PROCEDURE SORT used 12.25 seconds.
>
>
>315 +
>316 +
>317 +
>318 +/* Get min and max */
>319 +title "Sorted, unindexed";
>320 +proc means data= foo min max;
>321 + var x;
>322 + run;
>
>NOTE: The PROCEDURE MEANS used 0.49 seconds.
>
>
>323 +
>324 +proc sql;
>325 + select min(x), max(x) from foo;
>326 + quit;
>NOTE: The PROCEDURE SQL used 1.08 seconds.
>
>
>327 +
>328 +
>NOTE: %INCLUDE (level 1) ending.
>
>Karsten M. Self (kmself@ix.netcom.com)
>
> What part of "gestalt" don't you understand?
> (Welchen Teil von "Gestalt" verstehen Sie nicht?)
>
