Date: Sat, 11 Aug 2007 02:08:25 -0400
Reply-To: "Richard A. DeVenezia" <rdevenezia@WILDBLUE.NET>
Sender: "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From: "Richard A. DeVenezia" <rdevenezia@WILDBLUE.NET>
Organization: Internet News Service
Subject: Re: Finding parent-child relationship within ID variable
Hari wrote:
> On Jul 13, 6:09 pm, Hari <excel_h...@yahoo.com> wrote:
>> On Jul 12, 8:53 pm, Nord...@DSHS.WA.GOV ("Nordlund, Dan (DSHS/RDA)")
>> wrote:
>>
>>
>>
>>
>>
>>>> -----Original Message-----
>>>> From: SAS(r) Discussion [mailto:SA...@LISTSERV.UGA.EDU] On
>>>> Behalf Of Hari
>>>> Sent: Thursday, July 12, 2007 3:54 AM
>>>> To: SA...@LISTSERV.UGA.EDU
>>>> Subject: Re: Finding parent-child relationship within ID variable
>>
>>>> On Jul 12, 3:38 am, Nord...@DSHS.WA.GOV ("Nordlund, Dan
>>>> (DSHS/RDA)") wrote:
>>>>>> -----Original Message-----
>>>>>> From: SAS(r) Discussion [mailto:S...@LISTSERV.UGA.EDU] On
>>>>>> Behalf Of Hari
>>>>>> Sent: Wednesday, July 11, 2007 2:39 PM
>>>>>> To: S...@LISTSERV.UGA.EDU
>>>>>> Subject: Finding parent-child relationship within ID variable
>>
>>>>>> Hi,
>>
>>>>>> I have a SAS data set A with 2 fields. Both are ID fields
>>>> and all the
>>>>>> digits in the ID are "numbers" and stored in $16 format.
>>
>>>>>> The data set A was originally created by using a join of
>>>> a data set B
>>>>>> with itself (B*B) and only those records which met
>>>> certain conditions
>>>>>> where only retained in the output. The original data set B had 80
>>>>>> million obs and right now the final output data set A has
>>>> close to 90
>>>>>> Million observations.
>>
>>>>>> By doing the join operation Im trying to see as to which ID's are
>>>>>> related to each other. (basically, the data B is customer
>>>> level data
>>>>>> which was house-holded using address logic. Presently I want to
>>>>>> do additonal house-holding using email/name etc...hence the
>>>> join was done
>>>>>> to see which customers have same email/name but diff
>>>> house-hold level
>>>>>> ID's)
>>
>>>>>> *21 sample records (for simplicity sake im putting the length of
>>>>>> fields as just 2);
>>
>>>>>> ID_1 ID_2
>>>>>> 11 67
>>>>>> 11 55
>>>>>> 11 98
>>>>>> 67 55
>>>>>> 55 67
>>>>>> 98 11
>>>>>> 98 55
>>>>>> 55 98
>>>>>> 67 98
>>>>>> 55 11
>>>>>> 98 67
>>>>>> 67 11
>>>>>> 23 45
>>>>>> 92 35
>>>>>> 45 23
>>>>>> 92 29
>>>>>> 29 35
>>>>>> 35 92
>>>>>> 29 92
>>>>>> 35 29
>>>>>> 29 35
>>
>>>>>> Now, for me distinction between ID_1 and ID_2 is
>>>> meaningless. As long
>>>>>> as ID 11 is related to ID 67, it means the same as another record
>>>>>> which says ID 67 is related to ID 11. Hence, I want to reduce the
>>>>>> above data set to the below format, such that only unique
>>>> information
>>>>>> is retained
>>
>>>>>> ID_modified1 ID_modified_2
>>>>>> 11 67
>>>>>> 11 55
>>>>>> 11 98
>>>>>> 23 45
>>>>>> 29 35
>>>>>> 29 92
>>
>>>>>> (In above output am retaining the minimum ID in left and the rest
>>>>>> higher ID's in right, but I can keep it anyway as long as
>>>> information
>>>>>> isnt duplicated)
>>
>>>>>> Whats an efficient way to arrive at this output? (I hope
>>>> to learn some
>>>>>> nice techniques!)
>>
>>>>>> Btw, I do know one method which is to Transpose data set A by
>>>>>> ID_1 (and do Var ID_2) and after transposing rearranging the
>>>> ID values in
>>>>>> each record individually such that minimum ID is in left
>>>> most variable
>>>>>> and maximum in right most and so on.Finally, do a SQL
>>>> distinct on the
>>>>>> rearranged data set and this distinct data set can again
>>>> be transposed
>>>>>> by the left most ID variable to get the final output.
>>
>>>>>> The steps that I dont really like in my method is
>>>> rearranging because:
>>>>>> -
>>>>>> a) I really dont know in advance as to how many new transposed
>>>>>> variables will be created. It can be even as big as 200
>>>> or even 400! I
>>>>>> can probably do a proc contents and store the number of resulting
>>>>>> fields in a macro variable
>>
>>>>>> b) More bigger problem is rearrangng itself. Since, num of fields
>>>>>> might even be 400 or so, I will usually create that many
>>>> dummy data
>>>>>> set variables, but again I dont know an efficient way to sort all
>>>>>> variables within each record individually.
>>
>>>>>> Please guide me.
>>
>>>>>> regards,
>>>>>> HP
>>>>>> India
>>
>>>>> Hari,
>>
>>>>> You could try something like:
>>
>>>>> data temp / view = temp;
>>>>> set have;
>>>>> if id_1 LT ID_2 then do;
>>>>> t=id_1;
>>>>> id_1=id_2;
>>>>> id_2=t;
>>>>> end;
>>>>> run;
>>>>> proc sql ;
>>>>> create table want as
>>>>> select distinct id_1, id_2
>>>>> from temp
>>>>> order by id_1
>>>>> ;
>>>>> quit;
>>
>>>>> Hope this is helpful,
>>
>>>>> Dan
>>
>>>>> Daniel J. Nordlund
>>>>> Research and Data Analysis
>>>>> Washington State Department of Social and Health Services
>>>>> Olympia, WA 98504-5204- Hide quoted text -
>>
>>>>> - Show quoted text -
>>
>>>> Daniel,
>>
>>>> That wouldnt work in the present case, because after the first data
>>>> (view) step, the temp would like (I have dropped the variable t as
>>>> it is unneccesary)
>>
>>>> ID_1 ID_2
>>>> 11 67
>>>> 11 55
>>>> 11 98
>>>> 55 67
>>>> 55 67
>>>> 11 98
>>>> 55 98
>>>> 55 98
>>>> 67 98
>>>> 11 55
>>>> 67 98
>>>> 11 67
>>>> 23 45
>>>> 35 92
>>>> 23 45
>>>> 29 92
>>>> 29 35
>>>> 35 92
>>>> 29 92
>>>> 29 35
>>>> 29 35
>>
>>>> Now when we do distinct using the "Want" data set result would like
>>
>>>> ID_1 ID_2
>>>> 11 67
>>>> 11 55
>>>> 11 98
>>>> 23 45
>>>> 29 35
>>>> 29 92
>>>> 35 92
>>>> 55 67
>>>> 55 98
>>>> 67 98
>>
>>>> Which is not equivalent to the result Im looking for. if you notice
>>>> above there is a "information" duplication above wherein 11 is
>>>> mapped to 67 and 55 and in a later record we have 55 mapped to 67,
>>>> which is unnecessary for my case.
>>
>>>> hp
>>
>>> Hari,
>>
>>> If I understand correctly (and we already know that is not a given
>>> :-), you can just do an extra SQL query to eliminate rows where
>>> id_1 shows up in id_2.
>>
>>> data temp / view = temp;
>>> set have;
>>> if id_2 LT ID_1 then do;
>>> t=id_1;
>>> id_1=id_2;
>>> id_2=t;
>>> end;
>>> drop t;
>>> run;
>>> proc sql ;
>>> create table temp2 as
>>> select distinct id_1, id_2
>>> from temp
>>> order by id_1
>>> ;
>>> create table want as
>>> select a.*
>>> from temp2 as a
>>> where a.id_1 not in(select id_2 from temp2)
>>> ;
>>> quit;
>>
>>> Hope this is helpful,
>>
>>> Dan
>>
>>> Daniel J. Nordlund
>>> Research and Data Analysis
>>> Washington State Department of Social and Health Services
>>> Olympia, WA 98504-5204- Hide quoted text -
>>
>>> - Show quoted text -
>>
>> Daniel,
>>
>> Thanks a TON. Your last step got me what I was looking for.
>>
>> Barry - Thanks to you as well. I learned yet another method.
>>
>> hp- Hide quoted text -
>>
>> - Show quoted text -
>
> Hi,
>
> I had posted this problem quite some time back and I realized today
> that my understanding of the input data structure is incomplete.
> Originally I assumed a data relationship in a combinatorial format
> like the one originally stated.
>
> But it seems, I have a slightly different structure as given below,
> wherein not all possible relationships are directly represented in the
> data
>
> 1 2
> 1 7
> 2 3
> 2 4
> 4 5
> 3 6
> 2 1
> 7 1
> 3 2
> 4 2
> 5 4
> 6 3
>
> Now, in above even though 1 is related to just 2 and 7, in reality 1
> is related to all the rest. That because 2 is related to 3 and 4. 4 is
> related to 5 and finally 3 is related to 6.
>
> In short, I want the output from above to look like
>
> 1 2
> 1 3
> 1 4
> 1 5
> 1 6
> 1 7
>
> Can you please guide me
Take a look at
http://www.devenezia.com/downloads/sas/samples, superset.
--
Richard A. DeVenezia
|