LISTSERV at the University of Georgia
Menubar Imagemap
Home Browse Manage Request Manuals Register
Previous messageNext messagePrevious in topicNext in topicPrevious by same authorNext by same authorPrevious page (April 2007, week 2)Back to main SAS-L pageJoin or leave SAS-L (or change settings)ReplyPost a new messageSearchProportional fontNon-proportional font
Date:   Mon, 9 Apr 2007 22:30:34 -0700
Reply-To:   David L Cassell <davidlcassell@MSN.COM>
Sender:   "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From:   David L Cassell <davidlcassell@MSN.COM>
Subject:   Re: Split sample
In-Reply-To:   <200704092227.l39JEDBe029304@mailgw.cc.uga.edu>
Content-Type:   text/plain; format=flowed

Howard sagely replied: > >On Mon, 9 Apr 2007 18:02:30 -0400, data _null_; <datanull@GMAIL.COM> wrote: > > >If I understand correctly, you can do this with PROC RANK as below. > > > > > >proc plan seed=499749471; > > factors y = 100 of 10000; > > output out=work.sample; > > run; > > quit; > >proc rank groups=6 data=work.sample out=work.sample; > > var y; > > ranks group; > > run; > >proc print; > > run; > >Another illustration: > > data for_rank / view=for_rank; > set sashelp.class; > groupnum = ranuni(1357); > run; > > proc rank groups=5 data=for_rank out=split; > var groupnum; > run; >

Nice.

Here's a one-pass solution, based on a mathematical extension to the classical k/n algorithm. We basically perform the k/n algorithm on multiple categories simultaneously. So it's simple to show by induction that each category gets a random sample of the data set.

/* Some annoying code by David Cassell to do a K-way split. */ /* Last update: Feb 17, 2007 */ /* I wrote this for K-fold cross-validation problems. */

%let k=10; %let indata=temp1; %let seed=405848483;

data xv1(drop=_p1-_p&K _j); array m{&K} _temporary_ (&K * 0); /* counts of total records allocated per group so far */ array x{&K} _temporary_ ; /* max counts for the groups */ array p{&K} _p1-_p&K ; /* probabilities for next assignment of group */ set &INDATA nobs=numrecs;

/* set the max values for the categories */ if _n_=1 then do; do _j = 1 to &K; x{_j} = int(numrecs/&K) + 1*(_j le mod(numrecs,&K)); end; end;

/* compute p{*} given NUMRECS, _N_, and the values in m{}, x{} */ do _j = 1 to &K; p{_j} = ( x{_j}-m{_j} ) / (numrecs+1-_n_); end;

/* select the value of GROUP and then adjust m{} accordingly */ group = rantbl(&seed, of p{*} ); m{group} = m{group}+1; run;

So this code does a roughly even split of the records into the K groups, so it's ready for use in something like the cross-validation code I wrote in my "Don't Be Loopy: Re-Sampling and Simulation the SAS(R) Way" paper for SGF 2007.

*BUT* this is one of the pieces which I didn't have room for in the paper. It was already 20 pages when I sent it off. I didn't have room for this, or several additional bootstraps, or a decent coverage of simulation problems. The paper does have code for Leave-One-Out Cross-Validation and Random K-Fold Cross-Validation, as well as plenty of other wackiness.

David -- David L. Cassell mathematical statistician Design Pathways 3115 NW Norwood Pl. Corvallis OR 97330

> > > >On 4/9/07, mcolowasth@yahoo.co.uk <mcolowasth@yahoo.co.uk> wrote: > >> Hi all, > >> > >> I want to split my sample into 5 or 6 equal groups. How can I do this >in > >> SAS? > >> > >> Thanks a lot > >>

_________________________________________________________________ Mortgage rates near historic lows. Refinance $200,000 loan for as low as $771/month* https://www2.nextag.com/goto.jsp?product=100000035&url=%2fst.jsp&tm=y&search=mortgage_text_links_88_h27f8&disc=y&vers=689&s=4056&p=5117


Back to: Top of message | Previous page | Main SAS-L page