LISTSERV at the University of Georgia
Menubar Imagemap
Home Browse Manage Request Manuals Register
Previous messageNext messagePrevious in topicNext in topicPrevious by same authorNext by same authorPrevious page (November 2002, week 3)Back to main SAS-L pageJoin or leave SAS-L (or change settings)ReplyPost a new messageSearchProportional fontNon-proportional font
Date:   Wed, 20 Nov 2002 16:05:14 GMT
Reply-To:   Tim Berryhill <tim@AARTWOLF.COM>
Sender:   "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From:   Tim Berryhill <tim@AARTWOLF.COM>
Organization:   Concentric Internet Services
Subject:   Re: Manipulation of Very large data files - Merges and sampling Help needed badly

Dear Adfasdf, I accidentally replied to your ID rather than the list, so this bounced... Larry, Which input are you most concerned with minimising? Do you want code which runs quickly, code which does not request more RAM, code which does not request more DASD (actually, these often move together), code which is easy to write, code which is easy to modify, or code which is easy to explain? I almost always have something else to do, so quickly-written code which runs for days IS efficient.

Are you going to combine the data once to form your analytic files, or will you form many different combinations? It sounds like you only need to merge once, then draw the repeated samples. If so, I would concentrate on optimising the sampling process.

That is, Mr. Dorfman can write something for #1 which will read the keys from 7 files into memory, sort and merge the keys, then pull the matching obs from the input files and write the merged result, but you or I might spend more time trying to really understand that code than would be spent sorting each file once and merging them. On the other hand, if you need to form many different combinations, then it will pay off to study his previous posts (http://listserv.uga.edu/archives/sas-l.html) and learn a new way to use SAS.

Regarding the stratified random samples, there have been some lovely examples posted here--you may want to search the archive (http://listserv.uga.edu/archives/sas-l.html). The code which runs most quickly would probably read the obs number and stratification values into memory, then in each stratification cell draw without replacement (generate random numbers), and finally read the actual obs with SET ... POINT=.... and write them out. You might set up the initial merge to generate the base dataset sorted by the stratification variables and also generate the population statistics you need for the sampling. Then the sampling code could just generate enough offsets into each cell, calculate the POINT= values, read and write.

With a little care and infinite DASD, you could write the bootstrapping macro to read the keys once and call RANUNI as needed, read any ob used in any sample, and write all samples (avoiding rereading an ob used in several samples).

Is your stratification proportional to the population, or is it weighted? The code is less complex if you want 3% in each cell than if the percentage increases as some measure of cell variance increases. Also, some people expect a 3% chance of drawing each ob (allowing some cells to be over- or under-sampled), while others prefer exactly the per-cell ob counts they requested in each sample.

If you post more details, the list can provide more suggestions. Tim Berryhill

"adfasdf" <asdfasdf@zdfgasdf.com> wrote in message news:DxYB9.139566$c51.41148786@twister.nyroc.rr.com... > Folks, > I am working with some rather large datasets which require me to first merge > a bunch of these files together to get my analytic files. These files have > in excess of 50 million obs and are approximately 30 gigs in size when > merged. > > 1) What is the most efficient way to merge together these files together > for a given linking variable? > 1.a) sort by the by variable and merge in a data step? > 1.b) use some type of SQL statement to achieve this? > 1.c) index the files first and then merge them? > I want to know the most efficient way of doing this. > My machine is a Dell 6300 Server with a Perc2- quad channel 128 meg card, 1 > gig of ram and lots of 10K drives. > > 2. I will need to skip through this 50 million obs dataset to pull a random > sample that is stratified along two variables X & Y. I want to be able to > do this many times for the purpose of bootstrapping. > > > I could easily brute force both problems, but given the size of the files > and the corresponding cost of time, I would greatly appreciate any wisdom > gained from the ol' SAS-L crowd. > > Thanks, > Larry Van Horn > Rochester, NY > vanhorn@simon.rochester.edu > >


Back to: Top of message | Previous page | Main SAS-L page