Date: Wed, 19 Mar 2003 11:33:08 -0700
Reply-To: Jack Hamilton <JackHamilton@FIRSTHEALTH.COM>
Sender: "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From: Jack Hamilton <JackHamilton@FIRSTHEALTH.COM>
Subject: Re: Help with making SQL/Data Step More Efficient - WasWorkaround
fo r No Resources
Content-Type: text/plain; charset=us-ascii
If you don't need to use matcheSN_BS again after matches2 has been
created, one possible improvement would be to change
create table matcheSN_BS
to
create view matcheSN_BS
That would eliminate at least one read and one write of the data.
There will still be an internal sort, so you still won't have only one
pass through the data. Sorting in SQL might be more efficient than
sorting with PROC SORT, but that's not a given. The savings probably
come from fewer passes through the data, not from better sorting in PROC
SQL (the latter knows more about the structure of the data, but I don't
know if the code has been optimized to take advantage of that).
You could probably eliminate the data step using the CASE construct,
which is the SQL equivalent of IF. I haven't studied the code carefully
enough to decide if that's doable in this case, and it might not be more
efficient anyway.
A data step is probably the most efficient way to split one data set
into four data sets. SQL can write only one data set at a time, so
you'd have to make four passes through the data, whereas the data step
can do it with one.
You didn't say whether there's additional processing in the middle, but
if there's not you could create your 4 data sets in the data step you
show below.
--
JackHamilton@FirstHealth.com
Manager, Technical Development
Metrics Department, First Health
West Sacramento, California USA
>>> "Gerstle, John" <yzg9@CDC.GOV> 03/19/2003 8:29 AM >>>
Folks,
A couple of weeks ago I ran into a problem where my computer could not
run a
Cartesian match program using SQL. After finding out from SAS Tech
Support
that my hard drive was too small for the creation of the large dataset,
I
was blessed with a brand spanking new computer, 30GB hard drive, 400MB
RAM,
Windows 2000 with NTFS. My program then was able to run - a Cartesian
match
of a 7100 record dataset which created a ~25,000,000 record dataset via
SQL.
It took 51 minutes real time and 14 (!) minutes cpu time.
Well, I had to cut down the original dataset from 14,000 for this
immediate
run, but will need to re-run with the entire dataset instead of the
'halved'
dataset ran above. And I expect to have larger datasets that I will
need to
run over the next several weeks/months. So, I set out to make my SQL
and
the data step immediately following it more efficient so that I would
not
use excess memory/time. I removed all non-essential variables and
moved the
PROC SORT into the PROC SQL. (Also changed my default Paging File
Option
size to range between 2.5 and 4 GB)
The most recent run with the 7100 record dataset (Code is at the bottom
of
the email):
**************
NOTE: The execution of this query involves performing one or more
Cartesian
product joins that can not be optimized.
NOTE: Table WORK.MATCHESN_BS created, with 25837266 rows and 6
columns.
NOTE: PROCEDURE SQL used:
real time 37:18.79
cpu time 7:48.07
NOTE: There were 25837266 observations read from the data set
WORK.MATCHESN_BS.
NOTE: The data set WORK.MATCHES2 has 25837266 observations and 8
variables.
NOTE: DATA statement used:
real time 6:38.96
cpu time 57.28 seconds
***********
The PROC SQL creates a table of all possible matches of each record in
the
dataset. The following DATA STEP defines whether or not STATENO1 & 2
match
and BIRTHSEX1 & 2 match. I need the numbers for determining the number
of
concordant and discordant pairs. After this, I then divvy up this
large
dataset into four smaller ones based on whether we have matches or not
on
both STATENO and BIRTHSEX (think of the four quadrants of the
frequency
table between matchSN and matchBS). It seems to me that I should be
able to
combine all of this into the one SQL step, thereby only
creating/reading the
dataset once. Unfortunately, I'm still grasping the SQL wisdom and
haven't
the best idea on how to do that.
Suggestions?
And is there other wisdom I need here or should I expect to be running
this
program over night (which is an option)???
Thanks!!!
***************************
CODE:
proc sql; create table matcheSN_BS as
select t1.lineID as lineID1, t2.lineID as lineID2,
t1.stateno as stateno1, t2.stateno as
stateno2,
t1.birthsex as birthsex1, t2.birthsex as
birthsex2
from temp2 as t1, temp2 as t2 where t1.lineID < t2.LineID order
by
stateno1,stateno2,lineID1 ;
quit;
data matches2 ; set matcheSN_BS; by stateno1 stateno2 lineID1 ;
matchSN=2;
matchBS=2;
if stateno1 = stateno2 then matchSN = 1;
if birthsex1 = birthsex2 then matchBS = 1;
format matchSN matchBS linked. ;
label matchSN = 'STATENO Pairs' matchBS = 'BIRTHSEX Pairs';
run;
John Gerstle
CDC Information Technological Support Contract (CITS)
Biostatistician