LISTSERV at the University of Georgia
Menubar Imagemap
Home Browse Manage Request Manuals Register
Previous messageNext messagePrevious in topicNext in topicPrevious by same authorNext by same authorPrevious page (August 2007, week 2)Back to main SAS-L pageJoin or leave SAS-L (or change settings)ReplyPost a new messageSearchProportional fontNon-proportional font
Date:         Fri, 10 Aug 2007 10:42:00 -0700
Reply-To:     "Pardee, Roy" <pardee.r@GHC.ORG>
Sender:       "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From:         "Pardee, Roy" <pardee.r@GHC.ORG>
Subject:      Re: Normalizing relationships in data set
In-Reply-To:  A<200708100927.l7A0J9FF028696@mailgw.cc.uga.edu>
Content-Type: text/plain; charset="us-ascii"

Here is a brute-force iterative approach for a similar problem I had--trying to construct 'families' from pairwise member-subscriber relationships (members are people covered by the plan, subscribers are the people responsible for the coverage).

This may not be performant enough for you, and in any event, you'll need to decide how many 'degrees' of relatedness you want to chase down before stopping.

FWIW--I believe oracle (and probably other dbs by now) have their own SQL extensions to do this sort of recursive join. If you can get your data into oracle, you may be able to get a lot of mileage out of its CONNECT BY PRIOR syntax...

HTH,

-Roy

* ==================================================== ;

data people ; length chsid subchsid$ 15 ; input chsid $ subchsid $ ; * Roy & spouse cover each other. Roy covers RoyDaughter, RoyJunior and ; * RoySon, but spouse only covers RoyDaughter and RoyJunior. ; * Can we reliably put RoySon and RoySpouse in the same family? ; * Gene is all by his onesey. ; datalines ; Roy Roy RoySpouse Roy RoySon Roy RoyDaughter Roy RoyJunior Roy RoySpouse RoySpouse RoyDaughter RoySpouse RoyJunior RoySpouse Tyler Tyler Tyler TylerSpouse TylerSpouse Tyler TylerSpouse TylerSpouse TylerDaughter Tyler TylerDaughter TylerSpouse Gene Gene 2Dad 2Dad 2Dad 4Mom 4Mom 2Dad 4Mom 4Mom AKid 2Dad AKid 4Mom AKid 7Someone CS CS PF CS TT CS CS TT TT TT UP TT 56Mom 56Mom G8StepDad 56Mom ZNKid 56Mom AExHusband AExHusband DDKid AExHusband MIKid AExHusband VEKid AExHusband G8StepDad G8StepDad MIKid G8StepDad XGKid G8StepDad ZNKid G8StepDad ; run ;

%macro FamilyReunion(InSet = , OutSet = , id = MRN, subid = SubCHSID, MaxIterations = 1) ; proc sql ; * Make sure we are starting w/unique values of id & subid. ; create table _input as select distinct &id, &subid from &InSet ;

* All pairs who share a subscriber. ; create table linked_people as select distinct p1.&id as first_person, p2.&id as second_person from _input as p1 INNER JOIN _input as p2 on p1.&id = p2.&subid where p1.&id ne p2.&id order by p1.&id, p2.&id ; quit ;

* Make an arbitrary family identifier. ; * This puts the same people into multiple families--the families get merged below. ; data families ; set linked_people ; by first_person ; family_id + first.first_person ; run ;

proc sql ;

* Stack the first_persons on top of the second_persons, ; * to get a complete list of people and their families. ; create table families2 as select first_person as &id, family_id from families union select second_person as &id, family_id from families ;

drop table families ;

%let NumMultiples = 1 ;

%do %while ((&MaxIterations) > 0 AND (&NumMultiples > 0)) ; create table multiples&MaxIterations as select &id, min(family_id) as min_id, max(family_id) as max_id from families2 group by &id having min(family_id) ne max(family_id) ; %let NumMultiples = &SQLOBS ;

* We want to take all the records in families where ; * family_id = multiples.max_id and change it to multiples.min_id. ; create table new_families as select f.&id, coalesce(min_id, f.family_id) as family_id from families2 as f LEFT JOIN multiples&MaxIterations as m on f.family_id = m.max_id ;

drop table families2 ; drop table multiples&MaxIterations ;

create table families2 as select * from new_families ;

drop table new_families ;

%let MaxIterations = %eval(&MaxIterations - 1) ; %end ;

create table family_members as select distinct &id, family_id from families2 order by &id, family_id ; quit ;

* This guarantees us a one-id-per-row output dset. ; * If anyone remains in multiple families ; * then the output dset will contain multiple family_ vars. ; proc transpose data = family_members out = tpose(drop = _name_) prefix = family_ ; var family_id ; by &id ; run ;

proc sql ; create table &OutSet as select * from tpose where &id in (select &id from &InSet) order by family_1, &id ;

drop table tpose ;

quit ;

%mend FamilyReunion ;

options mlogic nocenter ;

%FamilyReunion(InSet = people, OutSet = waka, id = chsid, subid = subchsid, MaxIterations = 1) ;

proc print data = waka ; run ;

-----Original Message----- From: SAS(r) Discussion [mailto:SAS-L@LISTSERV.UGA.EDU] On Behalf Of Hari Prasadh Sent: Friday, August 10, 2007 2:27 AM To: SAS-L@LISTSERV.UGA.EDU Subject: Normalizing relationships in data set

Hi,

I am starting a new post from "Finding parent-child relationship within ID variable" so that I can express my requirements better.

I have a data set with a particular ID. Based on some operations I have been able to determine that the different values of this ID are actually related to each other in the following manner:-

ID ID_Related 1 2 1 7 2 3 2 4 3 4 4 5 3 6 3 5 2 1 7 1 3 2 4 2 5 4 6 3 5 3 4 3

Based on the above list: -

a)I know that 1 is related to 2 and 7. b) Since 2 is related to 3 and 4, so this means that 1 is related to 2, 3, 4 and 7. c) Continuing in this fashion, I can determine that 1 is actually related to all numbers from 2 through 7. d) I would like to express c) as a simpler relationship or a data layout like the following, where a single ID is related to rest of the ID's.

ID ID_Related 1 2 1 3 1 4 1 5 1 6 1 7

I have indicated value 1 above as the parent ID but it doesnt really matter to me. I would be happy even with the output like the following as well:-

ID ID_Related 4 1 4 2 4 3 4 5 4 6 4 7

In the above example, we had 7 Id's forming ONE single group. In my actual data, there are close to 50 million ID's in all and they might collapse to 40 million groups or so.

Please guide me.

hp


Back to: Top of message | Previous page | Main SAS-L page