Date: Sun, 12 May 1996 19:59:58 GMT
Reply-To: Tadd Clayton <t.clayton@AUCKLAND.AC.NZ>
Sender: "SAS(r) Discussion" <SAS-L@UGA.CC.UGA.EDU>
From: Tadd Clayton <t.clayton@AUCKLAND.AC.NZ>
Organization: University of Auckland Medical School
Subject: Summary of responses: Output obs with duplicate ID values
Hi everyone
Thanks to everyone who posted a response to my problem with
outputting observations with duplicate ID variables. I should
have been been able to figure it out myself - in my defence it
was Friday afternoon NZ time.
The original problem was:
>I have a data set in which each observation should be uniquely
>identified by a combination of two variables - a school number
>and a serial number within each school. However, due to
problems
>with the data entry process, there are a number of observations
>with duplicate values for the combination of school and serial
>numbers.
>
>I would like to be able to compare the school and serial numbers
>for each observation with those from the previous observation
and
>then output *both* observations to a data set if they are
>duplicated. I can use a retain statement to define variables
>that will carry the school and serial numbers over iterations of
>the data step to allow the comparison but, as SAS appears to
work
>on an observation by observation basis only, I can't figure out
>how to output both observations. Can anyone offer a simple
>solution?
Most people suggested using the first.variable and last.variable
automatic variables. A good example is this code from Jay
Weedon:
Here's a simple solution that unconditionally outputs the FIRST
record
for each school-serial no. combination to a scratch file temp,
but
outputs subsequent records for that observation ONLY if all the
relevant variables match (I've left the <match criterion> bit to
you).
The second data step simply deletes all the observations in the
scratch file that have a single record associated with them:
data temp;
set old;
by school sn;
if first.sn then output;
else do;
<compute match criterion for subsequent records>
if <match criterion true> then output;
end;
data new;
set temp;
by school sn;
if first.sn and last.sn then delete;
run;
Notes:
1. I assume the dataset to be sorted or indexed by school-serial
no.
2. If there are more than 2 matching records for an observation,
ALL
of them will be output.
Jay Weedon.
Another approach suggested by Adam Hendricks was to use Proc SQL:
proc sql;
create table dups as
select school, serial, count(*) as count
from data.schools
group by 1,2
having count(*) > 1;
proc print;
run;
Thanks again for the help.
Tadd
--
Tadd Clayton Ph: 64 9 373 7599 ext. 6451
Research Officer Fax: 64 9 373 7486
Department of Paediatrics Email: t.clayton@auckland.ac.nz
School of Medicine
University of Auckland
Private Bag 92019
Auckland
NEW ZEALAND