Date: Thu, 20 Aug 2009 14:25:12 -0400
Reply-To: Ian Whitlock <iw1sas@GMAIL.COM>
Sender: "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From: Ian Whitlock <iw1sas@GMAIL.COM>
Subject: Re: Merge with duplicates
Content-Type: text/plain; charset=ISO-8859-1
Summary: analysis of a matching problem with code to illustrate
#iw-value=1
Ed,
Your question looks suspicious to me and I am not sure that you
got the help you need.
Given the two data sets
data set1;
input name $ 1-16 num 17-20;
datalines;
Mark Williamson 111
Joe Sanders 333
John Doe 454
;
data set2;
input name $ 1-16 num 17-20;
datalines;
Mark Williamson 787
Mark Williamson 333
John Doe 454
Mike Lee 444
Andrew Johnson 999
;
And the request to compare SET1 with SET2. My first thought would
be when is there a match 1) by NAME NUM, 2) by NAME alone, 3) by NUM
alone, and 4) by NAME and NUM independently (i.e. not together).
Your NAME_FLAG and NUM_FLAG when correctly coded, capture the
information in two of the four matching possibilities I would want.
Your TOTAL_FLAGS adds no new information not already contained in
your previous flags.
Here is code to show the problem and a simple direct solution.
data set1;
input name $ 1-16 num ;
datalines;
Mark Williamson 111
Joe Sanders 333
John Doe 454
Ed A 777
Ed B 666
;
data set2;
input name $ 1-16 num ;
datalines;
Mark Williamson 787
Mark Williamson 333
John Doe 454
Mike Lee 444
Andrew Johnson 999
Ed B 777
Ed A 666
;
proc sort data = set2 ;
by name num ;
run ;
proc sql ;
create table temp as
select name
, num
, name in (select name from set2) as name_flag
, num in (select num from set2) as num_flag
from set1
order by name, num ;
quit ;
data wanted ;
merge temp (in = temp) set2 (in = both) ;
by name num ;
if temp ;
match = both ;
run ;
title "WANTED" ;
proc print data = wanted ;
run ;
Ian Whitlock
================
Date: Mon, 17 Aug 2009 11:44:12 -0700
From: Ed A <behrens900@GMAIL.COM>
Subject: Merge with duplicates
Hello,
I am trying to do the following.
-Compare two data sets and flag every entry that occurs in BOTH
data sets. However, I only want to output the rows in the first data
set -- so if something is in the second data set and not in the
first,
I don't want it.
data set1;
input name $ 1-16 num 17-20;
datalines;
Mark Williamson 111
Joe Sanders 333
John Doe 454
;
data set2;
input name $ 1-16 num 17-20;
datalines;
Mark Williamson 787
Mark Williamson 333
John Doe 454
Mike Lee 444
Andrew Johnson 999
;
proc sort data = set1;
by name;
run;
proc sort data = set2;
by name;
run;
Data Matches;
merge set1(in=in_set1) set2(in=in_set2);
by name;
if in_set2;
if in_set1 = in_set2 then name_flag = 1;
else name_flag = 0;
if name = "" then name_flag = 0;
run;
proc sort data = set1;
by num;
run;
proc sort data = Matches;
by num;
run;
Data Matches2;
merge set1(in=in_set1) matches(in=in_matches);
by num;
if in_matches;
if in_set1 = in_matches then num_flag = 1;
else num_flag = 0;
if num=. then num_flag = 0;
total_flags = num_flag + name_flag;
run;
proc print data = matches2;
run;
Ideal Output would be:
Obs name num name_flag num_flag total_ flags
1 Mark Williamson 111 1 0 1
2 Joe Sanders 333 0 1 1
3 John Doe 454 1 1 2
Actual Output is this:
-I don't want any output from data set 2 that is not in data set 1.
Obs name num name_flag
num_flag total_ flags
1 Mark Williamson 333
1 1 2
2 Mike Lee 444
0 0 0
3 John Doe 454
1 1 2
4 Mark Williamson 787
1 0 1
5 Andrew Johnson 999
0 0 0
I tried nodups and nodupkeys but those did not work.