Date:  Mon, 9 Apr 2007 22:30:34 0700 
ReplyTo:  David L Cassell <davidlcassell@MSN.COM> 
Sender:  "SAS(r) Discussion" <SASL@LISTSERV.UGA.EDU> 
From:  David L Cassell <davidlcassell@MSN.COM> 
Subject:  Re: Split sample 
InReplyTo:  <200704092227.l39JEDBe029304@mailgw.cc.uga.edu> 
ContentType:  text/plain; format=flowed 

Howard sagely replied:
>
>On Mon, 9 Apr 2007 18:02:30 0400, data _null_; <datanull@GMAIL.COM> wrote:
>
> >If I understand correctly, you can do this with PROC RANK as below.
> >
> >
> >proc plan seed=499749471;
> > factors y = 100 of 10000;
> > output out=work.sample;
> > run;
> > quit;
> >proc rank groups=6 data=work.sample out=work.sample;
> > var y;
> > ranks group;
> > run;
> >proc print;
> > run;
>
>Another illustration:
>
> data for_rank / view=for_rank;
> set sashelp.class;
> groupnum = ranuni(1357);
> run;
>
> proc rank groups=5 data=for_rank out=split;
> var groupnum;
> run;
>
Nice.
Here's a onepass solution, based on a mathematical extension to the
classical k/n algorithm. We basically perform the k/n algorithm on
multiple categories simultaneously. So it's simple to show by induction
that each category gets a random sample of the data set.
/* Some annoying code by David Cassell to do a Kway split. */
/* Last update: Feb 17, 2007 */
/* I wrote this for Kfold crossvalidation problems. */
%let k=10;
%let indata=temp1;
%let seed=405848483;
data xv1(drop=_p1_p&K _j);
array m{&K} _temporary_ (&K * 0); /* counts of total records allocated
per group so far */
array x{&K} _temporary_ ; /* max counts for the groups */
array p{&K} _p1_p&K ; /* probabilities for next assignment
of group */
set &INDATA nobs=numrecs;
/* set the max values for the categories */
if _n_=1 then do;
do _j = 1 to &K;
x{_j} = int(numrecs/&K) + 1*(_j le mod(numrecs,&K));
end;
end;
/* compute p{*} given NUMRECS, _N_, and the values in m{}, x{} */
do _j = 1 to &K;
p{_j} = ( x{_j}m{_j} ) / (numrecs+1_n_);
end;
/* select the value of GROUP and then adjust m{} accordingly */
group = rantbl(&seed, of p{*} );
m{group} = m{group}+1;
run;
So this code does a roughly even split of the records into the K groups,
so it's ready for use in something like the crossvalidation code I wrote
in my "Don't Be Loopy: ReSampling and Simulation the SAS(R) Way"
paper for SGF 2007.
*BUT* this is one of the pieces which I didn't have room for in the
paper. It was already 20 pages when I sent it off. I didn't have room
for this, or several additional bootstraps, or a decent coverage of
simulation
problems. The paper does have code for LeaveOneOut CrossValidation
and Random KFold CrossValidation, as well as plenty of other wackiness.
David

David L. Cassell
mathematical statistician
Design Pathways
3115 NW Norwood Pl.
Corvallis OR 97330
> >
> >On 4/9/07, mcolowasth@yahoo.co.uk <mcolowasth@yahoo.co.uk> wrote:
> >> Hi all,
> >>
> >> I want to split my sample into 5 or 6 equal groups. How can I do this
>in
> >> SAS?
> >>
> >> Thanks a lot
> >>
_________________________________________________________________
Mortgage rates near historic lows. Refinance $200,000 loan for as low as
$771/month*
https://www2.nextag.com/goto.jsp?product=100000035&url=%2fst.jsp&tm=y&search=mortgage_text_links_88_h27f8&disc=y&vers=689&s=4056&p=5117
