Date: Mon, 9 Apr 2007 22:30:34 -0700
Reply-To: David L Cassell <davidlcassell@MSN.COM>
Sender: "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From: David L Cassell <davidlcassell@MSN.COM>
Subject: Re: Split sample
In-Reply-To: <200704092227.l39JEDBe029304@mailgw.cc.uga.edu>
Content-Type: text/plain; format=flowed
Howard sagely replied:
>
>On Mon, 9 Apr 2007 18:02:30 -0400, data _null_; <datanull@GMAIL.COM> wrote:
>
> >If I understand correctly, you can do this with PROC RANK as below.
> >
> >
> >proc plan seed=499749471;
> > factors y = 100 of 10000;
> > output out=work.sample;
> > run;
> > quit;
> >proc rank groups=6 data=work.sample out=work.sample;
> > var y;
> > ranks group;
> > run;
> >proc print;
> > run;
>
>Another illustration:
>
> data for_rank / view=for_rank;
> set sashelp.class;
> groupnum = ranuni(1357);
> run;
>
> proc rank groups=5 data=for_rank out=split;
> var groupnum;
> run;
>
Nice.
Here's a one-pass solution, based on a mathematical extension to the
classical k/n algorithm. We basically perform the k/n algorithm on
multiple categories simultaneously. So it's simple to show by induction
that each category gets a random sample of the data set.
/* Some annoying code by David Cassell to do a K-way split. */
/* Last update: Feb 17, 2007 */
/* I wrote this for K-fold cross-validation problems. */
%let k=10;
%let indata=temp1;
%let seed=405848483;
data xv1(drop=_p1-_p&K _j);
array m{&K} _temporary_ (&K * 0); /* counts of total records allocated
per group so far */
array x{&K} _temporary_ ; /* max counts for the groups */
array p{&K} _p1-_p&K ; /* probabilities for next assignment
of group */
set &INDATA nobs=numrecs;
/* set the max values for the categories */
if _n_=1 then do;
do _j = 1 to &K;
x{_j} = int(numrecs/&K) + 1*(_j le mod(numrecs,&K));
end;
end;
/* compute p{*} given NUMRECS, _N_, and the values in m{}, x{} */
do _j = 1 to &K;
p{_j} = ( x{_j}-m{_j} ) / (numrecs+1-_n_);
end;
/* select the value of GROUP and then adjust m{} accordingly */
group = rantbl(&seed, of p{*} );
m{group} = m{group}+1;
run;
So this code does a roughly even split of the records into the K groups,
so it's ready for use in something like the cross-validation code I wrote
in my "Don't Be Loopy: Re-Sampling and Simulation the SAS(R) Way"
paper for SGF 2007.
*BUT* this is one of the pieces which I didn't have room for in the
paper. It was already 20 pages when I sent it off. I didn't have room
for this, or several additional bootstraps, or a decent coverage of
simulation
problems. The paper does have code for Leave-One-Out Cross-Validation
and Random K-Fold Cross-Validation, as well as plenty of other wackiness.
David
--
David L. Cassell
mathematical statistician
Design Pathways
3115 NW Norwood Pl.
Corvallis OR 97330
> >
> >On 4/9/07, mcolowasth@yahoo.co.uk <mcolowasth@yahoo.co.uk> wrote:
> >> Hi all,
> >>
> >> I want to split my sample into 5 or 6 equal groups. How can I do this
>in
> >> SAS?
> >>
> >> Thanks a lot
> >>
_________________________________________________________________
Mortgage rates near historic lows. Refinance $200,000 loan for as low as
$771/month*
https://www2.nextag.com/goto.jsp?product=100000035&url=%2fst.jsp&tm=y&search=mortgage_text_links_88_h27f8&disc=y&vers=689&s=4056&p=5117