|
Keith Kaiser wrote:
> Here is a wide open question for the list.
>
> Ten SAS data sets, 2 million rows each, maybe 10 columns.
>
> I need to extract any multiple records out of them. A record has a
> column called IDNUM, it is a unique identifier for the data, and all
> 10 data sets are indexed and sorted by it. The only other key
> variable is a date field called PDATE, think patient and treatment
> date.
>
> How do I extract just the duplicates rows by IDNUM?
If this is so, then IDNUM is not a unique identifier.
Perhaps it is a Primary key of some principal table, and a foreign key to
several other related tables/
> I need duplicates
> by year, and across year, so if it's in the 1995 data and again in
> 2002 I need it, but I also need it if it is duplicated in 1995 only.
> Do I set the 10 data sets together first? Do I use some complex SQL,
> is their a method I don't know that makes it easy? After I get the
> new data se= t
> with just the dupes I can do the rest I'm just looking for a simple,
> quick method of finding all dupes and keeping them.
Sounds like 'dupedness' can occur across tables.
i.e.
IDNUM=1 in table1 and
IDNUM=1 in table2 infers a duplicate to be acted upon.
Are the tables segmented in any designed way?
i.e.
Each table is for PDATEs of a given year.
Regardless, this sample stacks the data using a view and queries out the
duplicated ids while maintaining each instance contributing to the
duplication state. Stacking large numbers of tables tends towards wallpaper
code, which in turn tends towards macro. Avoid macro unless the number of
tables (or names thereof) being stacked changes with each required 'run
through' the data.
data _1 _2 _3 _4 _5;
do id = 1 to 1000;
if ranuni(1) < 0.10 then output _1;
if ranuni(1) < 0.10 then output _2;
if ranuni(1) < 0.10 then output _3;
if ranuni(1) < 0.10 then output _4;
if ranuni(1) < 0.10 then output _5;
end;
run;
data stacked / view=stacked;
source='_1'; do until (_1); set _1 end=_1; output; end;
source='_2'; do until (_2); set _2 end=_2; output; end;
source='_3'; do until (_3); set _3 end=_3; output; end;
source='_4'; do until (_4); set _4 end=_4; output; end;
source='_5'; do until (_5); set _5 end=_5; output; end;
run;
proc sql;
create table dupids as
select id, source, count(id) as idcount
from stacked
group by id
having idcount > 1
;
quit;
--
Richard A. DeVenezia
http://www.devenezia.com/
|