| Date: | Wed, 20 Nov 2002 16:05:14 GMT |
| Reply-To: | Tim Berryhill <tim@AARTWOLF.COM> |
| Sender: | "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU> |
| From: | Tim Berryhill <tim@AARTWOLF.COM> |
| Organization: | Concentric Internet Services |
| Subject: | Re: Manipulation of Very large data files - Merges and sampling
Help needed badly |
|---|
Dear Adfasdf, I accidentally replied to your ID rather than the list, so
this bounced...
Larry, Which input are you most concerned with minimising? Do you want code
which runs quickly, code which does not request more RAM, code which does
not request more DASD (actually, these often move together), code which is
easy to write, code which is easy to modify, or code which is easy to
explain? I almost always have something else to do, so quickly-written code
which runs for days IS efficient.
Are you going to combine the data once to form your analytic files, or will
you form many different combinations? It sounds like you only need to merge
once, then draw the repeated samples. If so, I would concentrate on
optimising the sampling process.
That is, Mr. Dorfman can write something for #1 which will read the keys
from 7 files into memory, sort and merge the keys, then pull the matching
obs from the input files and write the merged result, but you or I might
spend more time trying to really understand that code than would be spent
sorting each file once and merging them. On the other hand, if you need to
form many different combinations, then it will pay off to study his previous
posts (http://listserv.uga.edu/archives/sas-l.html) and learn a new way to
use SAS.
Regarding the stratified random samples, there have been some lovely
examples posted here--you may want to search the archive
(http://listserv.uga.edu/archives/sas-l.html). The code which runs most
quickly would probably read the obs number and stratification values into
memory, then in each stratification cell draw without replacement (generate
random numbers), and finally read the actual obs with SET ... POINT=.... and
write them out. You might set up the initial merge to generate the base
dataset sorted by the stratification variables and also generate the
population statistics you need for the sampling. Then the sampling code
could just generate enough offsets into each cell, calculate the POINT=
values, read and write.
With a little care and infinite DASD, you could write the bootstrapping
macro to read the keys once and call RANUNI as needed, read any ob used in
any sample, and write all samples (avoiding rereading an ob used in several
samples).
Is your stratification proportional to the population, or is it weighted?
The code is less complex if you want 3% in each cell than if the percentage
increases as some measure of cell variance increases. Also, some people
expect a 3% chance of drawing each ob (allowing some cells to be over- or
under-sampled), while others prefer exactly the per-cell ob counts they
requested in each sample.
If you post more details, the list can provide more suggestions.
Tim Berryhill
"adfasdf" <asdfasdf@zdfgasdf.com> wrote in message
news:DxYB9.139566$c51.41148786@twister.nyroc.rr.com...
> Folks,
> I am working with some rather large datasets which require me to first
merge
> a bunch of these files together to get my analytic files. These files
have
> in excess of 50 million obs and are approximately 30 gigs in size when
> merged.
>
> 1) What is the most efficient way to merge together these files together
> for a given linking variable?
> 1.a) sort by the by variable and merge in a data step?
> 1.b) use some type of SQL statement to achieve this?
> 1.c) index the files first and then merge them?
> I want to know the most efficient way of doing this.
> My machine is a Dell 6300 Server with a Perc2- quad channel 128 meg card,
1
> gig of ram and lots of 10K drives.
>
> 2. I will need to skip through this 50 million obs dataset to pull a
random
> sample that is stratified along two variables X & Y. I want to be able to
> do this many times for the purpose of bootstrapping.
>
>
> I could easily brute force both problems, but given the size of the files
> and the corresponding cost of time, I would greatly appreciate any wisdom
> gained from the ol' SAS-L crowd.
>
> Thanks,
> Larry Van Horn
> Rochester, NY
> vanhorn@simon.rochester.edu
>
>
|