Date: Wed, 24 Sep 2003 20:58:55 GMT
Reply-To: julierog@ix.netcom.com
Sender: "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From: Roger Lustig <trovato@VERIZON.NET>
Subject: Re: Qn: How to Merge Data-sets based on Matches in Variable
substring
Content-Type: text/plain; charset=us-ascii; format=flowed
Jack:
The original join code had a join -- the comma between the two data
sets, which means as much as "inner join". It also had a WHERE, which
can be used with inner joins instead of an ON.
Note that SQL was not giving error messages, just warning that it
couldn't solve the problem economically. The problem as Kundu posed it
involved *not* having a key variable that's identical in both datasets.
And the reason that SQL had such trouble with it was that it
couldn't establish a sort order to make things efficient.
When you get right down to it, the join is asking the following:
if *any* substring of key1 having the length of key2 is identical to
key2, then join two records.
If key2 always has the same length, you can "optimize" by creating a new
variable called SUBKEY1:
%let b_len=3;
data sasdata.t1a;
set sasdata.t1;
serial + 1; *Needed to control duplicates;
do I=1 to (1+length(key1)-&b_len);
subkey1=substr(key1,I,&b_len);
end;
run;
Now you can run a simple inner join:
proc sql;
create table sasdata.t3(drop=subkey1 serial) as
select
distinct /*other half of duplicate control */
*
from
sasdata.t1a,
sasdata.t2
where
a.subkey1 = b.key2
;
quit;
(If key1 is unique, then you don't need serial, but you need the
DISTINCT in case there's a key1 that has 2 or more instances of a
particular matching key2 as a substring. This probably isn't quite
right either, but it's a start...)
If the length of key2 is not always the same, you're either stuck using
the INDEX function or making a much larger sasdata.t1a, with one record
for each substring of each possible length...
Best,
Roger
Jack wrote:
> You need to use a join clause like ON. Your error is that SAS doesn't
> know how to match the two datasets together. In other words it knows
> what to accept but not how to paste them together.
>
> If you like the standard datastep in you code, you could try to do the
> same thing with a merge clause. You need to establish the BY vars
> (keyX??). After the two datasets are sorted you could use the subset
> of the two datasets (that match your inclusion clause).
>
> data test;
> merge sasdata.t1(in=in_t1)
> sasdata.t2(in=in_t2);
> by your_by_var(s);
> if index(key1,key2) >0;
> run;
>
> Your SQL statement may work once yu establish the by vars too.
>
> from sasdata.t1 a left join sasdata.t2 b
> on a.your_t1_key_var = b.your_t2_key_var
> where index(a.key1,bkey2);
> I HAVEN'T CHECKED THE SQL CODE THOUGH, I HOPE THIS HELPS.
>
>
>
> omkundu@DELOITTE.COM (Kundu, Om US - Boston) wrote in message news:<66614C61EFE5E34AA5BA93349D9E86761D8C14@usestnt410.us.deloitte.com>...
>
>>I'm trying to merge 2 datasets based upon whether a substring of a
>>text-variable in one dataset occurs in a text-variable of the other dataset.
>>Upon invoking the index function for performing a SQL inner-join (per code
>>below), SAS is unable to complete it and reports that the query involves a
>>Cartesian-product that is not optimized:
>>
>>PROC SQL;
>>CREATE TABLE sasdata.t3 AS
>> SELECT *
>> FROM sasdata.t1 T1, sasdata.t2 T2
>> WHERE INDEX(T1.key1, T2.key2)>0;
>>RUN;
>>
>>Any quick suggestions to (a)work around this OR (b)alternative ways to
>>accomplish this Merge would be greatly appreciated.
>>
>>Thanks in advance!
>>~Om
>>This message (including any attachments) contains confidential information
>>intended for a specific individual and purpose, and is protected by law. If
>>you are not the intended recipient, you should delete this message. Any
>>disclosure, copying, or distribution of this message, or the taking of any
>>action based on it, is strictly prohibited.
|