Date: Fri, 26 Mar 2010 13:56:52 -0400
Reply-To: "Keintz, H. Mark" <mkeintz@WHARTON.UPENN.EDU>
Sender: "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From: "Keintz, H. Mark" <mkeintz@WHARTON.UPENN.EDU>
Subject: Re: PROC COMPARE checking unique records irrespective of position
Content-Type: text/plain; charset="us-ascii"
Many options are available to you.
1. Mark T has already presented an SQL approach.
2. Sorting datasets and then merging (or interleaving them) to
find the number of records for each value of the variables found
in each datasets. Would take 3 steps.
3. Use hash tables. This is a single data step. Read dataset1
into one ordered hash table keyed on your variable, read in
dataset2 into another. Then traverse with a hash iterator noting
which keys are in both, in hash 1 only, or hash 2 only. A bit
more complicated programming, but it is a single step.
4. But I think this is one of those tasks which initially seems
most suitable to SQL or DATA step programming but might be done
more efficaciously via one of the many SAS PROCs. Here's your
problem submitted to PROC SUMMARY (I still call it by its archaic
alias PROC MEANS). It's two steps, as below:
** Step 1: Make a data view, to be run through PROC MEANS. **;
data viewboth / view=viewboth;
set data1 (in=in1) data2 (in=in2);
** Step 2: This proc means will give, for each value of VAR,
the sum of the dummy variables PRESENT1 and PRESENT2, i.e.
the number of records in each dataset **;
proc means data=viewboth noprint;
var present1 present2;
output out=want (drop=_TYPE_ _FREQ_) sum=NRECS1 NRECS2;
For large datasets I suspect that this might be faster than
those SQL suggestions which need to actually match records in
the two datasts. PROC means, on the other hand will just run
from top to bottom of the concatenated datasets, keeping running
tabs on record counts in each dataset per class.
> -----Original Message-----
> From: SAS(r) Discussion [mailto:SAS-L@LISTSERV.UGA.EDU] On Behalf Of
> Sent: Friday, March 26, 2010 12:46 PM
> To: SAS-L@LISTSERV.UGA.EDU
> Subject: PROC COMPARE checking unique records irrespective of position
> If I have two datasets, each having one variable and all observations
> being unique. only that one has more observations than the other.
> one two
> var1 var2
> ABC ABC
> AAA AAA
> CCC CCC
> ACX XYZ
> XYZ ZZZ
> Because of this, if I use PROC COMPARE, the output dataset shows first
> 3 values matching. But even though #5 in one is equal to #4 in two
> (similarly, #6 in one = #5 in two), the output dataset, from #4 and
> beyond shows that these are not matching. Is there any way PROC
> COMPARE can compare the values of a variable irrespective of position?
> It would be helpful in this case since all observations are unique.
> Is there anyway using PROC COMPARE I can get matching observations