LISTSERV at the University of Georgia
Menubar Imagemap
Home Browse Manage Request Manuals Register
Previous messageNext messagePrevious in topicNext in topicPrevious by same authorNext by same authorPrevious page (November 2005, week 3)Back to main SAS-L pageJoin or leave SAS-L (or change settings)ReplyPost a new messageSearchProportional fontNon-proportional font
Date:   Tue, 15 Nov 2005 09:52:07 -0500
Reply-To:   "Richard A. DeVenezia" <radevenz@IX.NETCOM.COM>
Sender:   "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From:   "Richard A. DeVenezia" <radevenz@IX.NETCOM.COM>
Subject:   Re: Base vs. SQL
Comments:   To: sas-l@uga.edu

Keith Kaiser wrote: > Here is a wide open question for the list. > > Ten SAS data sets, 2 million rows each, maybe 10 columns. > > I need to extract any multiple records out of them. A record has a > column called IDNUM, it is a unique identifier for the data, and all > 10 data sets are indexed and sorted by it. The only other key > variable is a date field called PDATE, think patient and treatment > date. > > How do I extract just the duplicates rows by IDNUM?

If this is so, then IDNUM is not a unique identifier. Perhaps it is a Primary key of some principal table, and a foreign key to several other related tables/

> I need duplicates > by year, and across year, so if it's in the 1995 data and again in > 2002 I need it, but I also need it if it is duplicated in 1995 only.

> Do I set the 10 data sets together first? Do I use some complex SQL, > is their a method I don't know that makes it easy? After I get the > new data se= t > with just the dupes I can do the rest I'm just looking for a simple, > quick method of finding all dupes and keeping them.

Sounds like 'dupedness' can occur across tables. i.e. IDNUM=1 in table1 and IDNUM=1 in table2 infers a duplicate to be acted upon.

Are the tables segmented in any designed way? i.e. Each table is for PDATEs of a given year.

Regardless, this sample stacks the data using a view and queries out the duplicated ids while maintaining each instance contributing to the duplication state. Stacking large numbers of tables tends towards wallpaper code, which in turn tends towards macro. Avoid macro unless the number of tables (or names thereof) being stacked changes with each required 'run through' the data.

data _1 _2 _3 _4 _5;

do id = 1 to 1000; if ranuni(1) < 0.10 then output _1; if ranuni(1) < 0.10 then output _2; if ranuni(1) < 0.10 then output _3; if ranuni(1) < 0.10 then output _4; if ranuni(1) < 0.10 then output _5; end;

run;

data stacked / view=stacked; source='_1'; do until (_1); set _1 end=_1; output; end; source='_2'; do until (_2); set _2 end=_2; output; end; source='_3'; do until (_3); set _3 end=_3; output; end; source='_4'; do until (_4); set _4 end=_4; output; end; source='_5'; do until (_5); set _5 end=_5; output; end; run;

proc sql; create table dupids as select id, source, count(id) as idcount from stacked group by id having idcount > 1 ; quit;

-- Richard A. DeVenezia http://www.devenezia.com/


Back to: Top of message | Previous page | Main SAS-L page