Date: Fri, 2 Mar 2001 17:27:05 -0000
Reply-To: sashole@bellsouth.net
Sender: "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From: Paul Dorfman <paul_dorfman@HOTMAIL.COM>
Subject: Re: Match Merge
Content-Type: text/plain; format=flowed
>From: James McMillen <jimcmi2@USWEST.COM>
>I am confident I just do not understand how the SAS merge works.
You have not presented any evidence to prove it :-).
>Please help.
Gladly, but you did not tell what you thought was wrong.
>I have two data sets, one with customer phone numbers and zip_codes, >one
>of just zip codes. The first has 10.5 million observations, >the second
>has l227. I want customers that have zips codes that >match the second
>file. There are multiple customers in each zip, >but that zip only shows
>on the zip file once.
For this, why even bother with sorts and merge? Try the thingie below and
(just for the sake of fun), compare the run-time with that of the merge:
data indterr.btns_covered_mt
indterr.btns_notcovered_mt
;
array xzip (0:99999) _temporary_;
if _n_=1 then do until (zeof);
set indterr.Zips_state_sort2 (keep=zip_code) end=zeof;
xzip (zip_code) = 1;
end;
set indterr.custbtns_new_sort;
if xzip (zip_code) then output indterr.btns_covered_mt;
else output indterr.btns_notcovered_mt;
run;
Note that with this, you do not have to spend time sorting either file - it
(and any other direct-addressing scheme) does not rely on any kind of input
order. But of course if the files are already in order, it does not hurt
(nor does it help). I assume that you have 5-digit, numeric zip codes. If
they are 5-digit character, use the expression
xzip (input(zip_code,5.))
instead of you understand what. If you have 9-digits zip codes, do not try
it at home by expanding the array to [0:999999999] range! Instead, write to
SAS-L. The 'wide-range' situation can be dealt with in a way, much more
clever than buying 20 Gb of memory :-).
If you need to drag variables other than the key (zip_code) from the small
data set into the output, create a corresponding number of temporary arrays
having proper types/declared lenghts, parallel to xzip.
Kind regards,
====================
Paul M. Dorfman
Jacksonville, Fl
====================
>Here is my code, and the log results: (I only get one match per
>zip_code, but there are multiple for each zip)
>
>838 data indterr.btns_covered_mt indterr.btns_notcovered_mt;
>839 btns=0;
>840 zips5=0;
>841 merge indterr.Zips_state_sort2 (in=zips5)
>842 indterr.custbtns_new_sort (in=btns) ;
>843 by zip_code ;
>844 If btns = 1 and zips5 = 1 then output indterr.btns_covered_mt;
>845 else if btns = 1 and zips5 = 0 then output
>indterr.btns_notcovered_mt ;
>
>846 run ;
>
>NOTE: There were 1227 observations read from the data set
>INDTERR.ZIPS_STATE_SORT2.
>NOTE: There were 10549178 observations read from the data set
>INDTERR.CUSTBTNS_NEW_SORT.
>NOTE: The data set INDTERR.BTNS_COVERED_MT has 1121 observations and 4
>variables.
>NOTE: The data set INDTERR.BTNS_NOTCOVERED_MT has 10548057 observations
>and 4 variables.
_________________________________________________________________
Get your FREE download of MSN Explorer at http://explorer.msn.com