LISTSERV at the University of Georgia
Menubar Imagemap
Home Browse Manage Request Manuals Register
Previous messageNext messagePrevious in topicNext in topicPrevious by same authorNext by same authorPrevious page (June 2007, week 3)Back to main SAS-L pageJoin or leave SAS-L (or change settings)ReplyPost a new messageSearchProportional fontNon-proportional font
Date:         Wed, 20 Jun 2007 07:12:13 -0400
Reply-To:     "data _null_;" <datanull@GMAIL.COM>
Sender:       "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From:         "data _null_;" <datanull@GMAIL.COM>
Subject:      Re: two-variable deduplication problem
Comments: To: Paul <paul.vonhippel@chase.com>
In-Reply-To:  <1182337181.383043.234850@q69g2000hsb.googlegroups.com>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed

I don't understand the criteria. You say the desired output

ID1 ID2 1 a 2 b

From ID1 ID2 1 a 1 b 2 b 2 a

But why not ID1 ID2 1 b 2 a How do you decide?

On 6/20/07, Paul <paul.vonhippel@chase.com> wrote: > P.S. I should say that my data set contains just a few thousand > records, so a dynamic list of unique ID1 and ID2 variables should fit > comfortably in memory. I just don't know how to do it. > > On Jun 20, 3:46 am, Paul <paul.vonhip...@chase.com> wrote: > > Thanks, David, but the usual solution doesn't work here. Remember that > > I want to delete an observation if the value of ID1 *or* ID2 has > > appeared before. The PROC SORT trick -- > > > > proc sort data=adataset nodupkey; > > by id1 id2; > > run; > > > > -- deletes an observation if the value-pair (ID1, ID2) has appeared > > before. Consider this simplified example: > > > > ID1 ID2 > > 1 a > > 1 b > > 2 b > > 2 a > > > > PROC SORT NODUPKEY does not remove any cases, and neither does > > deleting cases with FIRST.ID1 and FIRST.ID2. The desired result would > > be this: > > > > ID1 ID2 > > 1 a > > 2 b > > > > I've considered running PROC SORT NODUPKEY twice, first sorting by ID1 > > and then sorting by ID2. But that deletes all but one case, yielding > > this: > > > > ID1 ID2 > > 1 a > > > > So I'm afraid need a different approach. I've never used hash tables, > > so example code would be helpful. Or maybe someone has a simpler > > approach that doesn't involve hashing? > > > > Note that the original dataset is not and should not be sorted on > > (ID1,ID2). Instead, assume that it's sorted by some other variable > > that indicates the desirability of keeping each case. > > > > Thanks again, > > Paul > > > > On Jun 19, 8:47 pm, David <davids...@gmail.com> wrote: > > > > > The usual approach is > > > > > proc sort data=adataset nodupkey; > > > by id1 id2; > > > run; > > > > > You could do this, as you suggest, by maintaining a hash table with a > > > list of ID1 and ID2 pairs, and checking this table for each > > > observation in the dataset, but it's harder to code. > > > > > However blindling deleteting one of the observations might not be as > > > good as finding some business rules to decide which observation should > > > be kept: May I suggest you do something like: > > > > > proc sort data=adataset; > > > by id1 id2; > > > run; > > > > > data adataset; > > > set adataset; > > > by id1 id2; > > > if not (first.id1 and first.id2) then do; > > > if /* insert logic here */ then delete; > > > end; > > > run; > > > > > On Jun 20, 3:12 pm, Paul <paul.vonhip...@chase.com> wrote: > > > > > > I have a data set with two ID variables ID1 and ID2. I would like to > > > > step through the data set from top to bottom, removing observations if > > > > I have already seen their value of ID1 or ID2. What I need, I guess, > > > > is some way of appending ID values into a growing list that I can > > > > refer to as I step through the data set. > > > > > > Suggestions welcome.... > > > > > > Thanks! > > > > Paul >


Back to: Top of message | Previous page | Main SAS-L page