Date: Fri, 10 Aug 2007 10:42:00 -0700
Reply-To: "Pardee, Roy" <pardee.r@GHC.ORG>
Sender: "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From: "Pardee, Roy" <pardee.r@GHC.ORG>
Subject: Re: Normalizing relationships in data set
In-Reply-To: A<200708100927.l7A0J9FF028696@mailgw.cc.uga.edu>
Content-Type: text/plain; charset="us-ascii"
Here is a brute-force iterative approach for a similar problem I
had--trying to construct 'families' from pairwise member-subscriber
relationships (members are people covered by the plan, subscribers are
the people responsible for the coverage).
This may not be performant enough for you, and in any event, you'll need
to decide how many 'degrees' of relatedness you want to chase down
before stopping.
FWIW--I believe oracle (and probably other dbs by now) have their own
SQL extensions to do this sort of recursive join. If you can get your
data into oracle, you may be able to get a lot of mileage out of its
CONNECT BY PRIOR syntax...
HTH,
-Roy
* ==================================================== ;
data people ;
length chsid subchsid$ 15 ;
input chsid $ subchsid $ ;
* Roy & spouse cover each other. Roy covers RoyDaughter, RoyJunior and
;
* RoySon, but spouse only covers RoyDaughter and RoyJunior. ;
* Can we reliably put RoySon and RoySpouse in the same family? ;
* Gene is all by his onesey. ;
datalines ;
Roy Roy
RoySpouse Roy
RoySon Roy
RoyDaughter Roy
RoyJunior Roy
RoySpouse RoySpouse
RoyDaughter RoySpouse
RoyJunior RoySpouse
Tyler Tyler
Tyler TylerSpouse
TylerSpouse Tyler
TylerSpouse TylerSpouse
TylerDaughter Tyler
TylerDaughter TylerSpouse
Gene Gene
2Dad 2Dad
2Dad 4Mom
4Mom 2Dad
4Mom 4Mom
AKid 2Dad
AKid 4Mom
AKid 7Someone
CS CS
PF CS
TT CS
CS TT
TT TT
UP TT
56Mom 56Mom
G8StepDad 56Mom
ZNKid 56Mom
AExHusband AExHusband
DDKid AExHusband
MIKid AExHusband
VEKid AExHusband
G8StepDad G8StepDad
MIKid G8StepDad
XGKid G8StepDad
ZNKid G8StepDad
;
run ;
%macro FamilyReunion(InSet = , OutSet = , id = MRN, subid = SubCHSID,
MaxIterations = 1) ;
proc sql ;
* Make sure we are starting w/unique values of id & subid. ;
create table _input as
select distinct &id, &subid
from &InSet ;
* All pairs who share a subscriber. ;
create table linked_people as
select distinct p1.&id as first_person, p2.&id as second_person
from _input as p1 INNER JOIN
_input as p2
on p1.&id = p2.&subid
where p1.&id ne p2.&id
order by p1.&id, p2.&id
;
quit ;
* Make an arbitrary family identifier. ;
* This puts the same people into multiple families--the families get
merged below. ;
data families ;
set linked_people ;
by first_person ;
family_id + first.first_person ;
run ;
proc sql ;
* Stack the first_persons on top of the second_persons, ;
* to get a complete list of people and their families. ;
create table families2 as
select first_person as &id, family_id
from families
union
select second_person as &id, family_id
from families
;
drop table families ;
%let NumMultiples = 1 ;
%do %while ((&MaxIterations) > 0 AND (&NumMultiples > 0)) ;
create table multiples&MaxIterations as
select &id, min(family_id) as min_id, max(family_id) as max_id
from families2
group by &id
having min(family_id) ne max(family_id)
;
%let NumMultiples = &SQLOBS ;
* We want to take all the records in families where ;
* family_id = multiples.max_id and change it to
multiples.min_id. ;
create table new_families as
select f.&id, coalesce(min_id, f.family_id) as family_id
from families2 as f LEFT JOIN
multiples&MaxIterations as m
on f.family_id = m.max_id
;
drop table families2 ;
drop table multiples&MaxIterations ;
create table families2 as select * from new_families ;
drop table new_families ;
%let MaxIterations = %eval(&MaxIterations - 1) ;
%end ;
create table family_members as
select distinct &id, family_id
from families2
order by &id, family_id
;
quit ;
* This guarantees us a one-id-per-row output dset. ;
* If anyone remains in multiple families ;
* then the output dset will contain multiple family_ vars. ;
proc transpose data = family_members out = tpose(drop = _name_)
prefix = family_ ;
var family_id ;
by &id ;
run ;
proc sql ;
create table &OutSet as
select *
from tpose
where &id in (select &id from &InSet)
order by family_1, &id
;
drop table tpose ;
quit ;
%mend FamilyReunion ;
options mlogic nocenter ;
%FamilyReunion(InSet = people, OutSet = waka, id = chsid, subid =
subchsid, MaxIterations = 1) ;
proc print data = waka ;
run ;
-----Original Message-----
From: SAS(r) Discussion [mailto:SAS-L@LISTSERV.UGA.EDU] On Behalf Of
Hari Prasadh
Sent: Friday, August 10, 2007 2:27 AM
To: SAS-L@LISTSERV.UGA.EDU
Subject: Normalizing relationships in data set
Hi,
I am starting a new post from "Finding parent-child relationship within
ID variable" so that I can express my requirements better.
I have a data set with a particular ID. Based on some operations I have
been able to determine that the different values of this ID are actually
related to each other in the following manner:-
ID ID_Related
1 2
1 7
2 3
2 4
3 4
4 5
3 6
3 5
2 1
7 1
3 2
4 2
5 4
6 3
5 3
4 3
Based on the above list: -
a)I know that 1 is related to 2 and 7.
b) Since 2 is related to 3 and 4, so this means that 1 is related to 2,
3,
4 and 7.
c) Continuing in this fashion, I can determine that 1 is actually
related to all numbers from 2 through 7.
d) I would like to express c) as a simpler relationship or a data layout
like the following, where a single ID is related to rest of the ID's.
ID ID_Related
1 2
1 3
1 4
1 5
1 6
1 7
I have indicated value 1 above as the parent ID but it doesnt really
matter to me. I would be happy even with the output like the following
as well:-
ID ID_Related
4 1
4 2
4 3
4 5
4 6
4 7
In the above example, we had 7 Id's forming ONE single group. In my
actual data, there are close to 50 million ID's in all and they might
collapse to 40 million groups or so.
Please guide me.
hp