LISTSERV at the University of Georgia
Menubar Imagemap
Home Browse Manage Request Manuals Register
Previous messageNext messagePrevious in topicNext in topicPrevious by same authorNext by same authorPrevious page (March 2003, week 3)Back to main SAS-L pageJoin or leave SAS-L (or change settings)ReplyPost a new messageSearchProportional fontNon-proportional font
Date:         Wed, 19 Mar 2003 11:33:08 -0700
Reply-To:     Jack Hamilton <JackHamilton@FIRSTHEALTH.COM>
Sender:       "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From:         Jack Hamilton <JackHamilton@FIRSTHEALTH.COM>
Subject:      Re: Help with making SQL/Data Step More Efficient - WasWorkaround
              fo r No Resources
Comments: To: yzg9@CDC.GOV
Content-Type: text/plain; charset=us-ascii

If you don't need to use matcheSN_BS again after matches2 has been created, one possible improvement would be to change

create table matcheSN_BS

to

create view matcheSN_BS

That would eliminate at least one read and one write of the data.

There will still be an internal sort, so you still won't have only one pass through the data. Sorting in SQL might be more efficient than sorting with PROC SORT, but that's not a given. The savings probably come from fewer passes through the data, not from better sorting in PROC SQL (the latter knows more about the structure of the data, but I don't know if the code has been optimized to take advantage of that).

You could probably eliminate the data step using the CASE construct, which is the SQL equivalent of IF. I haven't studied the code carefully enough to decide if that's doable in this case, and it might not be more efficient anyway.

A data step is probably the most efficient way to split one data set into four data sets. SQL can write only one data set at a time, so you'd have to make four passes through the data, whereas the data step can do it with one.

You didn't say whether there's additional processing in the middle, but if there's not you could create your 4 data sets in the data step you show below.

-- JackHamilton@FirstHealth.com Manager, Technical Development Metrics Department, First Health West Sacramento, California USA

>>> "Gerstle, John" <yzg9@CDC.GOV> 03/19/2003 8:29 AM >>> Folks, A couple of weeks ago I ran into a problem where my computer could not run a Cartesian match program using SQL. After finding out from SAS Tech Support that my hard drive was too small for the creation of the large dataset, I was blessed with a brand spanking new computer, 30GB hard drive, 400MB RAM, Windows 2000 with NTFS. My program then was able to run - a Cartesian match of a 7100 record dataset which created a ~25,000,000 record dataset via SQL. It took 51 minutes real time and 14 (!) minutes cpu time.

Well, I had to cut down the original dataset from 14,000 for this immediate run, but will need to re-run with the entire dataset instead of the 'halved' dataset ran above. And I expect to have larger datasets that I will need to run over the next several weeks/months. So, I set out to make my SQL and the data step immediately following it more efficient so that I would not use excess memory/time. I removed all non-essential variables and moved the PROC SORT into the PROC SQL. (Also changed my default Paging File Option size to range between 2.5 and 4 GB)

The most recent run with the 7100 record dataset (Code is at the bottom of the email): ************** NOTE: The execution of this query involves performing one or more Cartesian product joins that can not be optimized. NOTE: Table WORK.MATCHESN_BS created, with 25837266 rows and 6 columns.

NOTE: PROCEDURE SQL used: real time 37:18.79 cpu time 7:48.07

NOTE: There were 25837266 observations read from the data set WORK.MATCHESN_BS. NOTE: The data set WORK.MATCHES2 has 25837266 observations and 8 variables. NOTE: DATA statement used: real time 6:38.96 cpu time 57.28 seconds ***********

The PROC SQL creates a table of all possible matches of each record in the dataset. The following DATA STEP defines whether or not STATENO1 & 2 match and BIRTHSEX1 & 2 match. I need the numbers for determining the number of concordant and discordant pairs. After this, I then divvy up this large dataset into four smaller ones based on whether we have matches or not on both STATENO and BIRTHSEX (think of the four quadrants of the frequency table between matchSN and matchBS). It seems to me that I should be able to combine all of this into the one SQL step, thereby only creating/reading the dataset once. Unfortunately, I'm still grasping the SQL wisdom and haven't the best idea on how to do that.

Suggestions?

And is there other wisdom I need here or should I expect to be running this program over night (which is an option)???

Thanks!!!

*************************** CODE:

proc sql; create table matcheSN_BS as select t1.lineID as lineID1, t2.lineID as lineID2, t1.stateno as stateno1, t2.stateno as stateno2, t1.birthsex as birthsex1, t2.birthsex as birthsex2 from temp2 as t1, temp2 as t2 where t1.lineID < t2.LineID order by stateno1,stateno2,lineID1 ; quit;

data matches2 ; set matcheSN_BS; by stateno1 stateno2 lineID1 ; matchSN=2; matchBS=2; if stateno1 = stateno2 then matchSN = 1; if birthsex1 = birthsex2 then matchBS = 1; format matchSN matchBS linked. ; label matchSN = 'STATENO Pairs' matchBS = 'BIRTHSEX Pairs'; run;

John Gerstle CDC Information Technological Support Contract (CITS) Biostatistician


Back to: Top of message | Previous page | Main SAS-L page