Date: Wed, 20 Jun 2007 07:12:13 -0400
Reply-To: "data _null_;" <datanull@GMAIL.COM>
Sender: "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From: "data _null_;" <datanull@GMAIL.COM>
Subject: Re: two-variable deduplication problem
In-Reply-To: <1182337181.383043.234850@q69g2000hsb.googlegroups.com>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
I don't understand the criteria. You say the desired output
ID1 ID2
1 a
2 b
From
ID1 ID2
1 a
1 b
2 b
2 a
But why not
ID1 ID2
1 b
2 a
How do you decide?
On 6/20/07, Paul <paul.vonhippel@chase.com> wrote:
> P.S. I should say that my data set contains just a few thousand
> records, so a dynamic list of unique ID1 and ID2 variables should fit
> comfortably in memory. I just don't know how to do it.
>
> On Jun 20, 3:46 am, Paul <paul.vonhip...@chase.com> wrote:
> > Thanks, David, but the usual solution doesn't work here. Remember that
> > I want to delete an observation if the value of ID1 *or* ID2 has
> > appeared before. The PROC SORT trick --
> >
> > proc sort data=adataset nodupkey;
> > by id1 id2;
> > run;
> >
> > -- deletes an observation if the value-pair (ID1, ID2) has appeared
> > before. Consider this simplified example:
> >
> > ID1 ID2
> > 1 a
> > 1 b
> > 2 b
> > 2 a
> >
> > PROC SORT NODUPKEY does not remove any cases, and neither does
> > deleting cases with FIRST.ID1 and FIRST.ID2. The desired result would
> > be this:
> >
> > ID1 ID2
> > 1 a
> > 2 b
> >
> > I've considered running PROC SORT NODUPKEY twice, first sorting by ID1
> > and then sorting by ID2. But that deletes all but one case, yielding
> > this:
> >
> > ID1 ID2
> > 1 a
> >
> > So I'm afraid need a different approach. I've never used hash tables,
> > so example code would be helpful. Or maybe someone has a simpler
> > approach that doesn't involve hashing?
> >
> > Note that the original dataset is not and should not be sorted on
> > (ID1,ID2). Instead, assume that it's sorted by some other variable
> > that indicates the desirability of keeping each case.
> >
> > Thanks again,
> > Paul
> >
> > On Jun 19, 8:47 pm, David <davids...@gmail.com> wrote:
> >
> > > The usual approach is
> >
> > > proc sort data=adataset nodupkey;
> > > by id1 id2;
> > > run;
> >
> > > You could do this, as you suggest, by maintaining a hash table with a
> > > list of ID1 and ID2 pairs, and checking this table for each
> > > observation in the dataset, but it's harder to code.
> >
> > > However blindling deleteting one of the observations might not be as
> > > good as finding some business rules to decide which observation should
> > > be kept: May I suggest you do something like:
> >
> > > proc sort data=adataset;
> > > by id1 id2;
> > > run;
> >
> > > data adataset;
> > > set adataset;
> > > by id1 id2;
> > > if not (first.id1 and first.id2) then do;
> > > if /* insert logic here */ then delete;
> > > end;
> > > run;
> >
> > > On Jun 20, 3:12 pm, Paul <paul.vonhip...@chase.com> wrote:
> >
> > > > I have a data set with two ID variables ID1 and ID2. I would like to
> > > > step through the data set from top to bottom, removing observations if
> > > > I have already seen their value of ID1 or ID2. What I need, I guess,
> > > > is some way of appending ID values into a growing list that I can
> > > > refer to as I step through the data set.
> >
> > > > Suggestions welcome....
> >
> > > > Thanks!
> > > > Paul
>
|