Date: Tue, 4 Jan 2011 13:12:16 -0800
Reply-To: "Nordlund, Dan (DSHS/RDA)" <NordlDJ@DSHS.WA.GOV>
Sender: "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From: "Nordlund, Dan (DSHS/RDA)" <NordlDJ@DSHS.WA.GOV>
Subject: Re: Is it sorted? Are there performance benefits?
In-Reply-To: <AANLkTin79h38p1acj+njT_SXBsfEm9p=-yLxvizBvdKL@mail.gmail.com>
Content-Type: text/plain; charset=utf-8
Sterling, Joe, et al.,
Have I missed something in this thread? It wouldn't be the first time. :-) But, in the log below, I see a "short-circuiting" of file reading when the SORTEDBY= option is specified.
113 data have;
114 do i=1 to 10000000;
115 do j=1 to 10;
116 output;
117 end;
118 end;
119 run;
NOTE: The data set WORK.HAVE has 100000000 observations and 2 variables.
NOTE: DATA statement used (Total process time):
real time 6.78 seconds
cpu time 5.74 seconds
120
121 data want(where=(i=1)) ;
122 set have;
123 run;
NOTE: There were 100000000 observations read from the data set WORK.HAVE.
NOTE: The data set WORK.WANT has 10 observations and 2 variables.
NOTE: DATA statement used (Total process time):
real time 9.20 seconds
cpu time 9.20 seconds
124 data want ;
125 set have(where=(i=1) sortedby=i);
126 run;
NOTE: There were 10 observations read from the data set WORK.HAVE.
WHERE i=1;
NOTE: The data set WORK.WANT has 10 observations and 2 variables.
NOTE: DATA statement used (Total process time):
real time 3.68 seconds
cpu time 3.66 seconds
127
128
129 data want(where=(i=1)) ;
130 set have;
131 by i;
132 run;
NOTE: There were 100000000 observations read from the data set WORK.HAVE.
NOTE: The data set WORK.WANT has 10 observations and 2 variables.
NOTE: DATA statement used (Total process time):
real time 13.27 seconds
cpu time 13.27 seconds
133 data want ;
134 set have(where=(i=1) sortedby=i);
135 by i;
136 run;
NOTE: There were 10 observations read from the data set WORK.HAVE.
WHERE i=1;
NOTE: The data set WORK.WANT has 10 observations and 2 variables.
NOTE: DATA statement used (Total process time):
real time 3.74 seconds
cpu time 3.74 seconds
So what am I missing in this discussion?
Dan
Daniel J. Nordlund
Washington State Department of Social and Health Services
Planning, Performance, and Accountability
Research and Data Analysis Division
Olympia, WA 98504-5204
> -----Original Message-----
> From: SAS(r) Discussion [mailto:SAS-L@LISTSERV.UGA.EDU] On Behalf Of
> Joe Matise
> Sent: Tuesday, January 04, 2011 12:43 PM
> To: SAS-L@LISTSERV.UGA.EDU
> Subject: Re: Is it sorted? Are there performance benefits?
>
> Do you need the index, or can you just remove it (in PROC DATASETS)?
>
> -Joe
>
> On Tue, Jan 4, 2011 at 2:24 PM, Sterling Paramore
> <gnilrets@gmail.com>wrote:
>
> > Good point Mark, I neglected to test that. So it looks like we're
> stuck
> > with the PROC SORT PRESORTED method discussed earlier. I'm having a
> > problem
> > with this, however, because the dataset comes indexed and PROC SORT
> wont
> > modify the indexed dataset even though I'm only asking it to validate
> the
> > sort order.
> >
> >
> > On Tue, Jan 4, 2011 at 11:53 AM, Keintz, H. Mark
> > <mkeintz@wharton.upenn.edu>wrote:
> >
> > > I also thought assigning the SORTEDBY attribute would generate the
> needed
> > > metadata without the burden of needlessly running proc sort. So I
> tried
> > it.
> > >
> > > But it did not improve data retrieval. Apparently the fact that
> the
> > > VALIDATED=NO attribute remained was sufficient to prevent the data
> engine
> > > from stopping once J>1.
> > >
> > > I would appreciate examples of performance improvements or other
> > efficacies
> > > (beyond documenting the sort order) from setting SORTEDBY in the
> absence
> > of
> > > validation by PROC SORT.
> > >
> > > Regards,
> > > Mark
> > >
> > >
> > >
> > > > -----Original Message-----
> > > > From: SAS(r) Discussion [mailto:SAS-L@LISTSERV.UGA.EDU] On Behalf
> Of
> > > > Mike Zdeb
> > > > Sent: Tuesday, January 04, 2011 2:22 PM
> > > > To: SAS-L@LISTSERV.UGA.EDU
> > > > Subject: Re: Is it sorted? Are there performance benefits?
> > > >
> > > > hi (again, forgot to add) ... and you could also add the
> attribute as
> > > > you make the data set ...
> > > >
> > > > data have;
> > > > input x @@;
> > > > datalines;
> > > > 1 3 2
> > > > ;
> > > > run;
> > > >
> > > > proc sort data = have;
> > > > by x;
> > > > run;
> > > >
> > > > data have_2 (sortedby=x);
> > > > set have;
> > > > by x;
> > > > y = x + 10;
> > > > run;
> > > >
> > > > --
> > > > Mike Zdeb
> > > > U@Albany School of Public Health
> > > > One University Place (Room 119)
> > > > Rensselaer, New York 12144-3456
> > > > P/518-402-6479 F/630-604-1475
> > > >
> > > > > hi ... if you are sure that a data set is sorted, you can add
> that
> > > > attribute ...
> > > > >
> > > > > proc datasets lib=work nolist;
> > > > > modify have_2 (sortedby=x);
> > > > > quit;
> > > > >
> > > > > proc contents data = have_2;
> > > > > run;
> > > > >
> > > > > you'll see ...
> > > > >
> > > > > Sort Information
> > > > > Sortedby x
> > > > > Validated NO
> > > > > Character Set ANSI
> > > > >
> > > > > and if you try to sort HAVE_2 ...
> > > > >
> > > > > 198
> > > > > 199 proc sort data=have_2;
> > > > > 200 by x;
> > > > > 201 run;
> > > > >
> > > > > NOTE: Input data set is already sorted, no sorting done.
> > > > > NOTE: PROCEDURE SORT used (Total process time):
> > > > > real time 0.00 seconds
> > > > > cpu time 0.00 seconds
> > > > >
> > > > >
> > > > > you can change that attribute even if the data set is not
> sorted ...
> > > > >
> > > > >
> > > > > data have;
> > > > > input x : $1. @@;
> > > > > datalines;
> > > > > 1 3 2
> > > > > ;
> > > > > run;
> > > > >
> > > > > proc datasets lib=work nolist;
> > > > > modify have (sortedby=x);
> > > > > quit;
> > > > >
> > > > > proc contents data=have;
> > > > > run;
> > > > >
> > > > > proc sort data=have;
> > > > > by x;
> > > > > run;
> > > > >
> > > > > and still get ...
> > > > >
> > > > > Sort Information
> > > > > Sortedby x
> > > > > Validated NO
> > > > > Character Set ANSI
> > > > >
> > > > > and ...
> > > > >
> > > > > 262 proc sort data=have;
> > > > > 263 by x;
> > > > > 264 run;
> > > > >
> > > > > NOTE: Input data set is already sorted, no sorting done.
> > > > > NOTE: PROCEDURE SORT used (Total process time):
> > > > > real time 0.00 seconds
> > > > > cpu time 0.00 seconds
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Mike Zdeb
> > > > > U@Albany School of Public Health
> > > > > One University Place (Room 119)
> > > > > Rensselaer, New York 12144-3456
> > > > > P/518-402-6479 F/630-604-1475
> > > > >
> > > > >> Dear SAS-L,
> > > > >>
> > > > >> I have a dataset that is sorted and proc contents reports it
> as
> > > > sorted. I
> > > > >> then read that dataset using a by statement and make a new
> column.
> > > > Suddenly
> > > > >> proc contents reports it as not sorted. Of course, it is
> still
> > > > sorted, but
> > > > >> I was wondering if anyone knew if there were performance
> problems
> > > > that would
> > > > >> occur since proc contents things it's not sorted.
> > > > >>
> > > > >> Here's an example to show what's happening:
> > > > >>
> > > > >> data have;
> > > > >> input x $;
> > > > >> datalines;
> > > > >> 1
> > > > >> 3
> > > > >> 2
> > > > >> ;
> > > > >> run;
> > > > >>
> > > > >> proc sort data = have;
> > > > >> by x;
> > > > >> run;
> > > > >>
> > > > >> data have_2;
> > > > >> set have;
> > > > >> by x;
> > > > >> y = x + 10;
> > > > >> run;
> > > > >>
> > > > >> proc contents data = have;
> > > > >> run;
> > > > >> proc contents data = have_2;
> > > > >> run;
> > > > >>
> > > > >> Thanks,
> > > > >> Sterling
> > > > >>
> > > > >
> > >
> >
|