```Date: Mon, 9 Apr 2007 22:30:34 -0700 Reply-To: David L Cassell Sender: "SAS(r) Discussion" From: David L Cassell Subject: Re: Split sample In-Reply-To: <200704092227.l39JEDBe029304@mailgw.cc.uga.edu> Content-Type: text/plain; format=flowed Howard sagely replied: > >On Mon, 9 Apr 2007 18:02:30 -0400, data _null_; wrote: > > >If I understand correctly, you can do this with PROC RANK as below. > > > > > >proc plan seed=499749471; > > factors y = 100 of 10000; > > output out=work.sample; > > run; > > quit; > >proc rank groups=6 data=work.sample out=work.sample; > > var y; > > ranks group; > > run; > >proc print; > > run; > >Another illustration: > > data for_rank / view=for_rank; > set sashelp.class; > groupnum = ranuni(1357); > run; > > proc rank groups=5 data=for_rank out=split; > var groupnum; > run; > Nice. Here's a one-pass solution, based on a mathematical extension to the classical k/n algorithm. We basically perform the k/n algorithm on multiple categories simultaneously. So it's simple to show by induction that each category gets a random sample of the data set. /* Some annoying code by David Cassell to do a K-way split. */ /* Last update: Feb 17, 2007 */ /* I wrote this for K-fold cross-validation problems. */ %let k=10; %let indata=temp1; %let seed=405848483; data xv1(drop=_p1-_p&K _j); array m{&K} _temporary_ (&K * 0); /* counts of total records allocated per group so far */ array x{&K} _temporary_ ; /* max counts for the groups */ array p{&K} _p1-_p&K ; /* probabilities for next assignment of group */ set &INDATA nobs=numrecs; /* set the max values for the categories */ if _n_=1 then do; do _j = 1 to &K; x{_j} = int(numrecs/&K) + 1*(_j le mod(numrecs,&K)); end; end; /* compute p{*} given NUMRECS, _N_, and the values in m{}, x{} */ do _j = 1 to &K; p{_j} = ( x{_j}-m{_j} ) / (numrecs+1-_n_); end; /* select the value of GROUP and then adjust m{} accordingly */ group = rantbl(&seed, of p{*} ); m{group} = m{group}+1; run; So this code does a roughly even split of the records into the K groups, so it's ready for use in something like the cross-validation code I wrote in my "Don't Be Loopy: Re-Sampling and Simulation the SAS(R) Way" paper for SGF 2007. *BUT* this is one of the pieces which I didn't have room for in the paper. It was already 20 pages when I sent it off. I didn't have room for this, or several additional bootstraps, or a decent coverage of simulation problems. The paper does have code for Leave-One-Out Cross-Validation and Random K-Fold Cross-Validation, as well as plenty of other wackiness. David -- David L. Cassell mathematical statistician Design Pathways 3115 NW Norwood Pl. Corvallis OR 97330 > > > >On 4/9/07, mcolowasth@yahoo.co.uk wrote: > >> Hi all, > >> > >> I want to split my sample into 5 or 6 equal groups. How can I do this >in > >> SAS? > >> > >> Thanks a lot > >> _________________________________________________________________ Mortgage rates near historic lows. Refinance \$200,000 loan for as low as \$771/month* https://www2.nextag.com/goto.jsp?product=100000035&url=%2fst.jsp&tm=y&search=mortgage_text_links_88_h27f8&disc=y&vers=689&s=4056&p=5117 ```

Back to: Top of message | Previous page | Main SAS-L page