Date: Fri, 14 Jan 2011 18:17:12 -0500
Reply-To: Arthur Tabachneck <art297@ROGERS.COM>
Sender: "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From: Arthur Tabachneck <art297@ROGERS.COM>
Subject: Re: Find Records in a Cluster by Date
Content-Type: text/plain; charset=ISO-8859-1
James,
The following probably a very inefficient way to do what you want but,
before trying to streamline it (if that is even necessary), does the
following do what you need? It assumes no overlap:
proc sort data=have;
by Patient Type Date;
run;
proc transpose data=have out=wide;
by Patient Type;
var Date;
run;
data temp (keep=Patient Type Date ClusterID);
set wide;
format Date mmddyy10.;
array Dates(*) COL:;
array ClusterIDs(50);
ClusterIDNumber=0;
do i=1 to 50;
if i gt dim(Dates) then i=50;
else if missing(Dates(i)) then i=50;
else do;
if i eq 1 then do;
ClusterIDNumber+1;
ClusterIDs(i)=ClusterIDNumber;
end;
else do;
if Dates(i)-Dates(i-1) gt 23 then do;
ClusterIDNumber+1;
ClusterIDs(i)=ClusterIDNumber;
end;
else ClusterIDs(i)=ClusterIDNumber;
end;
Date=Dates(i);
ClusterID=ClusterIDs(i);
output;
end;
end;
run;
proc sql noprint;
create table want as
select *, count(*) as ClusterSize
from temp
group by patient,type,clusterID
;
quit;
HTH,
Art
--------
On Fri, 14 Jan 2011 17:07:50 -0500, James, Steve (CDC/OID/OD) <spj1@CDC.GOV>
wrote:
>Thanks,
>
>We anticipate up to several million patients with a top end of 20-30
million vaccination records. There could be 40-50 records per patient with
as many as 14 different vaccine types, although we anticipate to average
about 10 records per patients.
>
>Steve
>
>-----Original Message-----
>From: Arthur Tabachneck [mailto:art297@ROGERS.COM]
>Sent: Friday, January 14, 2011 4:58 PM
>To: SAS-L@LISTSERV.UGA.EDU; James, Steve (CDC/OID/OD)
>Subject: Re: Find Records in a Cluster by Date
>
>Steve,
>
>I haven't kept up with this thread, thus might be asking a couple of
>questions that you've already answered.
>
>One, how many patients will be in the data?
>
>And, two, approximately what is the maximum number of vaccine records that
>might exist for a patient?
>
>Art
>--------
>On Fri, 14 Jan 2011 16:52:47 -0500, Steve James <spj1@CDC.GOV> wrote:
>
>>SAS-L
>>
>>Barry Swartz pointed out some discrepancies in my previous post, so let me
>>modify the test data a little and clarify what I want.
>>
>>We’re looking for possible duplicate vaccine records and the business
rule
>>is that we look for records within a date of 23 days of each other. We
>>define the clusters and then do the deduplication process within a
>>cluster. It’s theoretically possible a record could be in more than one
>>cluster, but I don’t know if I have any at this point.
>>
>>Also a patient can have multiple vaccine types and each type can have
>>multiple clusters.
>>
>>Hopefully that clarifies things.
>>
>>I’ve modified the data so that there are no overlapping clusters.
>>
>>Have:
>>
>>Patient Type Date
>>1 1 01/01/10
>>1 1 01/04/10
>>1 1 01/15/10
>>1 1 04/28/10
>>1 1 05/04/10
>>2 1 06/14/10
>>
>>Want
>>
>>Patient Type Date ClusterID ClusterSize
>>1 1 01/01/10 1 3
>>1 1 01/04/10 1 3
>>1 1 01/15/10 1 3
>>1 1 04/28/10 2 2
>>1 1 05/04/10 2 2
>>2 1 06/14/10 1 1
>>
>>
>>Thanks for your help,
>>
>>Steve
>>
>>_____________________________________________
>>From: James, Steve (CDC/OID/OD)
>>Sent: Friday, January 14, 2011 2:49 PM
>>To: SAS-L@LISTSERV.UGA.EDU
>>Subject: Find Records in a Cluster by Date
>>
>>
>>Dear SAS-L,
>>
>>I’m having a hard time getting my head around this one, and have had
very
>>little luck searching the SAS-L archives. I’m sure someone out there
can
>>whip out something in a second. And if not, that justifies my struggling
>>with it.
>>
>>I’ve got vaccination records ordered by patient and vaccine type and I
>>want to define a cluster for each record that is within 23 days of every
>>other record for the same patient and vaccine type. I would like some
>>sort of identifier that identifies each cluster and the size of the
>>cluster. I’m anticipating clusters of from 1-4 records but there
>>potentially could be larger ones.
>>
>>I figure that some records may be in multiple clusters, e.g. 1/1/10,
>>1/15/10, and 1/31/10 – not sure exactly what I want to do with that but
>>it’s something that could happen and if I can at least identify them
>>that’d be a plus.
>>
>>Below is a rough example of what I’m hoping to accomplish.
>>
>>Have:
>>
>>Patient Type Date
>>1 1 01/01/10
>>1 1 01/04/10
>>1 1 01/15/10
>>1 1 01/28/10
>>1 1 02/04/10
>>2 1 02/14/10
>>
>>Want
>>
>>Patient Type Date ClusterID ClusterSize
>>1 1 01/01/10 1 3
>>1 1 01/04/10 1 3
>>1 1 01/15/10 1 3
>>1 1 01/28/10 2 2
>>1 1 02/04/10 2 2
>>2 1 02/14/10 1 1
>>
>>
>>Thanks for your help,
>>
>>Steve
|