Date: Thu, 6 May 2010 16:30:15 -0500
Reply-To: Aldi <aldi@WUSTL.EDU>
Sender: "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From: Aldi <aldi@WUSTL.EDU>
Subject: Re: Fastest way to merge large dataset
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
I agree with you that if it looks a nightmare for a programmer it is not
a good solution. A solution is good if it is practical and applicable in
a specific setting.
Since we did not have any data to show the performance of a solution, I
was talking in general that a large problem that takes very long in SAS
it can be split into smaller sets and processes, perform the analysis on
each and then put the final resulting data/results together.
I was not talking for a merge only. SAS is a good programming language
that can become more efficient when used in parallel processing, as can
be done with other software in C or other languages.
For example: (Sigurd here I am talking for C-programming) Running a
program in C, which has started in November of 2009, using MCMC updates
of probabilities, 3 out of 22 jobs when using full sets of data have not
yet finished. When splitting these data into smaller sets the jobs
finish within hours using parallel processing.
The same is with SAS (I am talking for millions of observations as the
lister was mentioning), if one subsets the data into smaller sets, one
can shorten the time of run drastically. If one has one computer, one
cannot do what is described above. But if one has many computers it is
worth the try. Software that provide such parallel processing facility
are: SAS GRID/ LSF / GRIDWARE of Sun microsystems (the last one was
free, but do not know about it under Oracle). At my work we use LSF.
If things are set properly, what I described is an investment that
returns qualitative speed.
On 5/6/2010 3:56 PM, Sigurd Hermansen wrote:
> I have to disagree to the maximum degree allowed by law with your suggested method of improving SAS performance ;>
> Why substitute a programming nightmare for a SAS performance problem?
> SAS merges and joins perform faster when a database represents the same information in more rows and fewer variables. Many variables almost always result from denormalized datasets that repeat categorical or temporal event data within a row for an entity instead of across rows for an entity group. Subsetting these clumsy denormalized records merely compounds programming problems.
> Once upon a time a debate on database design pitted advocates of denormalized forms (although not called such) against advocates of normalized forms. Sure, the advocates of denormalized forms conceded (after see numerous demonstrations), the normalized forms require less space to store many narrow forms that join into a wide denormalized form, but the processing required to join the narrow forms takes too long for practical purposes. When advocates of normalized forms improved performance of database systems enough to counter this objection, wide denormalized largely disappeared at the logical level and were replaced by normalized relational databases. Many SAS programmers have resisted this technological advance, but far fewer today than when I first began a campaign on SAS-L and in SUGI/SGF meetings to incorporate relational database designs into the SAS System following Paul Kent's introduction of SAS Proc SQL in Version 6. SAS SQL makes it easy to create views, restructure denormalized datasets into 3rd normal form or better, and join datasets as needed to display or transfer datasets in forms required by clients (even statisticians).
> I'll desist in this case if Olivier has compelling reasons for stuffing 500 variables in a dataset and joining that dataset to another wide dataset. I haven't seen a good case as yet for wide datasets, other than the client requires them. Even so, I'll look at the pragmatics (how he plans to use datasets) and keep an open mind about the best way there.
> -----Original Message-----
> From: SAS(r) Discussion [mailto:SAS-L@LISTSERV.UGA.EDU] On Behalf Of Aldi
> Sent: Thursday, May 06, 2010 3:48 PM
> To: SAS-L@LISTSERV.UGA.EDU
> Subject: Re: Fastest way to merge large dataset
> The fastest way to work with very large SAS dataset in my experience is
> to split your problem into smaller pieces otherwise SAS will take long.
> You have to identify what are the processes in SAS that are expensive
> (for example sorting is very expensive in very large datasest, transpose
> is very expensive in very large sets (miilions of observations in a
> Create an index based on _N_ internal observation number of SAS and
> split the data into smaller pieces, for example sets of data by 5000
> observations in each. The index you create will keep your data intact.
> I use parallel processing, by sending each subset(s) in a separate
> processors and at the end, after done with processes wished, use SET
> statement to place the data together in the final set.
> Everything is done in SAS and processes that if I use full sets will
> take weeks or months to finalize are finished in a matter of hours. SAS
> is great for parallel processing.
> On 5/6/2010 2:10 PM, SUBSCRIBE SAS-L Olivier Van Parys wrote:
>> Hi Everyone,
>> People in my team are surprised by the amount of time required to merge large
>> dataset. I have 6 datset with 3 millions rows and 500 variables in each to
>> merge together - I am using Hash tables since I have read in a paper this was
>> the best way to do it - do you share this opinion or is there another
>> Olivier Van Parys, PhD