|
On Sep 27, 12:53 am, Tetyana <teryoshi...@prodigy.net> wrote:
> Hi All,
>
> My boss asked me for the answer and the code for the next question.
> Can you please help? I copied his question completely. I don't have
> any idea how to do it and I used SAS on a mainframe (UNIX) only once a
> long time ago.
>
> If I had big flat file not fitting into a disk, say 2.2 billion
> records, unsorted and has 50,000 different keys and I wanted to create
> another file by merging the big file with a much smaller file of
> 10,000 keys and I wanted only those records in the big file that DO
> NOT MATCH the keys in the smaller file, how would I do it in the
> mainframe. Remember that I can not allocate that much disk space even
> if it's temporary.
>
> Best Regards,
> Tetyana
Step one, DATA Step pass over SMALL
- INPUT your key values from SMALL
- use a DATA Step Hash to capture the distinct keys
- output to dataset EXCLUSION_KEYS (only 10,000 rows)
Step two, DATA Step pass over TAPE
- populate hash X from EXCLUSION_KEYS
- INPUT your key values from TAPE
- if X.find() ne 0; *TAPE keys not in EXCLUSION_KEYS;
- INPUT the remainder of the record
- OUTPUT to dataset FILTERED (or PUT to TAPE if filtered record count
going to be to big)
do stuff with FILTERED
Here is a sample you can fiddle with
----------------
%let path = %sysfunc(PATHNAME(work));
data _null_;
file "&path.\big.txt" dlm=',';
do _n_ = 1 to 20000;
key = 20000-_n_ + int(5*ranuni(1234));
array v(5) (1:5);
put key v1-v5;
end;
run;
data _null_;
file "&path.\small.txt" dlm=',';
do key = 1 to 20000;
if ranuni(1234) < 0.20 then PUT key;
end;
run;
data BIG_NOT_SMALL;
declare hash X ();
X.defineKey('key');
X.defineDone();
infile "&path.\small.txt" end=end_of_small;
do while (not end_of_small);
input key;
X.replace();
end;
_count = X.num_items;
put 'number of keys in X:' _count;
infile "&path.\big.txt" dlm=',' end=end_of_big;
do while (not end_of_big);
input key@;
if X.find() = 0 then
input;
else do;
input v1-v5;
OUTPUT;
end;
end;
stop;
drop _:;
run;
----------------
Note that the output is not sorted.
--
Richard A. DeVenezia
http://www.devenezia.com
|