Date: Mon, 19 Jun 2006 16:08:19 -0400
Reply-To: "Howard Schreier <hs AT dc-sug DOT org>" <nospam@HOWLES.COM>
Sender: "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From: "Howard Schreier <hs AT dc-sug DOT org>" <nospam@HOWLES.COM>
Subject: Re: Reading unique records
On Sun, 18 Jun 2006 23:18:52 -0400, Arthur Tabachneck <art297@NETSCAPE.NET>
wrote:
>Tenny,
>
>Then why not create a positional variable and simply do a second sort on
>it? For example:
>
>data want;
> set have;
> position=_n_;
>run;
>
>proc sort data=want nodupkey;
> by FIRSTNAME
> MIDDLENAME
> LASTNAME
> DOB
> BENSTARTDT
> BENENDDT;
>run;
>
>proc sort data=want;
> by position;
>run;
>
>Art
Since the original order is to be maintained, it probably matters which one
of a set of identical records is to be preserved; presumably it would be the
first one encountered. As I understand it, NODUPKEY cannot assure that.
>---------
>On Mon, 19 Jun 2006 04:09:10 +0100, tenny kurian <tennykurian@YAHOO.CO.IN>
>wrote:
>
>>hi Arth,
>>
>> Thank you arthur.
>>
>> i tried that, but if there is a missing value in the any of the sort
>variable, it goes to top. IT SHOULD NOT HAPPEN. i dont want to change the
>order of the records in the external flat file.
>>
>> i tried even noduprecs option. in that case also, duplicate records are
>removed after sorting. after sorting takes place, records are getting re
>arranged in order WHICH I DONT WANT.
>>
>> Thank You
>> Tenny
>>Arthur Tabachneck <art297@NETSCAPE.NET> wrote:
>> Tenny,
>>
>>Why not just sort the file, using the nodupkey option, and all variables
>>representing the by condition? For example,
>>
>>proc sort data=have out=want nodupkey;
>>by FIRSTNAME
>>MIDDLENAME
>>LASTNAME
>>DOB
>>BENSTARTDT
>>BENENDDT;
>>run;
>>
>>Art
>>-----------
>>On Mon, 19 Jun 2006 03:51:25 +0100, tenny kurian
>>wrote:
>>
>>>Hi Kevin,
>>>
>>> Thank you for your response,
>>>
>>> All the feilds in a record should be same to make it unique.
>>> it is not a 2 or 3 common fields. All the information contained in the
>>record are same, then i need to eliminate all the duplicate ones and keep
>>one record.
>>>
>>> Say for eg. A record contains following info.
>>> FIRSTNAME
>>> MIDDLENAME
>>> LASTNAME
>>> DOB
>>> BENSTARTDT
>>> BENENDDT
>>>
>>>
>>> if all the values contained in the above variables are repeated, then
>>delete the duplicate ones.
>>>
>>> i am using Unix - SAS v8
>>>
>>> Thank You,
>>> Tenny.
>>>Kevin Roland Viel wrote:
>>> On Sun, 18 Jun 2006, tenny kurian wrote:
>>>
>>>> Hi,
>>>>
>>>> i would like to get help for the following problem.
>>>>
>>>> i am getting input records from a flat file.
>>>>
>>>>
>>>> Each line in the external flat file corresponds to one record.
>>>> i am reading the external flat file using infile statement and using
>>coloumn pointers
>>>> LRECL is 300
>>>> there are some duplicates records in the flat file. it need not to be
>>in sequence.
>>>> i want to read only unique records. that means if there is a replica of
>>a record , then i want to read only the first occurence of that record.
>>>>
>>>> It would be really helpful if someone can help in resolving this issue.
>>>
>>>Tenny,
>>>
>>>You are not quite clear. You need to read a record to determine whether
>>>it is unique. Is this a problem of a large flat file that you are trying
>>>to make more efficient being reading on part of the record if it is not
>>>another with the same identifier has already been read?
>>>
>>>Also, you should state which version of SAS you have and what
>>>combination of fields make the record unique.
>>>
>>>I have assumed that only part of the record determines its ID and
>>>whether it is unique. I have also taken advantage of the HASH object
>>>available in v9:
>>>
>>>data _null_ ;
>>>
>>>file "C:\unique.txt" ;
>>>
>>>do x = 1 to 10 ;
>>>do y = 1 to 2 ;
>>>put x y ;
>>>end ;
>>>end ;
>>>run ;
>>>
>>>data unique ( keep = ID y ) ;
>>>
>>>if _n_ = 1 then
>>>do ;
>>>dcl hash unique() ;
>>>unique.Definekey ( "ID" ) ;
>>>unique.Definedone ( ) ;
>>>end ;
>>>
>>>infile "C:\unique.txt" ;
>>>input ID y ;
>>>
>>>__rc = unique.CHECK() ;
>>>
>>>if __rc ne 0 then
>>>do ;
>>>output ;
>>>__rc = unique.ADD() ;
>>>end ;
>>>
>>>run ;
>>>
>>>proc print data = unique ;
>>>run ;
>>>
>>>You could use the entire line as the ID, but you could start running into
>>>RAM limitations and start paging. I personally am very hesitant to not
>>>make a dataset from the original file and subset that. In the very least,
>>>I would count the number of duplicates I have and write that to the log,
>>>which I keep.
>>>
>>>This method is rather robust to the nature of the ID. If you do not have
>>>v9, then you can accomplish the same thing using an array, but the index
>>>can only be a number (in sharp contrast to the key of the hash).
>>>
>>>HTH,
>>>
>>>Kevin
>>>
>>>Kevin Viel
>>>Department of Epidemiology
>>>Rollins School of Public Health
>>>Emory University
>>>Atlanta, GA 30322
>>>
>>>
>>>
>>>---------------------------------
>>> Yahoo! India Answers: Share what you know. Learn something new Click
>>here
>>> Send free SMS to your Friends on Mobile from your Yahoo! Messenger
>>Download now
>>
>>
>>
>>
>>
>>---------------------------------
>> Yahoo! India Answers: Share what you know. Learn something new Click
>here
>> Send free SMS to your Friends on Mobile from your Yahoo! Messenger
>Download now
|