Date: Tue, 23 Nov 2004 10:13:24 -0700
Reply-To: Michael Murff <mjm33@MSM1.BYU.EDU>
Sender: "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From: Michael Murff <mjm33@MSM1.BYU.EDU>
Subject: Greedy match with by processing (relist with problem mockup)
Content-Type: text/plain; charset=US-ASCII
Hi SAS-L,
This greedy matching algorithm is almost right, but still outputs the
records where a match is not possible. Imagine a sample dataset with two
observations that are members of a rather obscure level of the by group.
And the control dataset (big) only has one matching record. The program
is jumping to the next available match in the next by group level (see
review dataset where d2 is not equal to zero) of the big dataset. I need
to output a missing value or not output at all. Can one of the gurus
take a minute to run this code? The basic idea is to create a dataset
that contains the start and stop position for each by group (on the
sorted big dataset) and then use these record numbers to simulate
by-group boundaries so that matching is only possible within a
particular by-group pair; thanks to an offline email from Dennis D.
run;
Michael Murff
***************************************************;
/* analysis variables */
%let size = size;
%let byvar1 = byvar1;
%let byvar2 = byvar2;
/* mockup dataset */
data testdata;
length cusip $8.;
do i=1 to 2000;
do j=1 to 100; /* simulate by group levels */
do k=1 to 5;
cusip = compress("A"||ranuni(1),.);
size = round(ranuni(5)*1000,.001);
output;
end;
end;
end;
drop i;
rename j = &byvar1 k = &byvar2;
run;
/* eliminate duplicates */
proc sort data=testdata nodupkey; by cusip;
run;
/* create sample (small) and control (big) */
data big small;
set testdata;
if _N_ lt 101 then output small;
else output big;
run;
proc sort data=small out=ssmall;
by &byvar1 &byvar2 descending &size;
run;
proc sort data=big out=sbig;
by &byvar1 &byvar2 descending &size;
run;
/* create a dataset with the start and end record numbers in BIG for
each set of BY variables */
data ibig(keep=&byvar1 &byvar2 _start _stop);
set sbig;
by &byvar1 &byvar2;
retain _start;
if first.&byvar1 then _start = _N_;
if last.&byvar2;
_stop=_N_;
run;
/* find closest match bet. sample and control groups */
/* this step seems is output records where no match is possible */
/* need to constrain it to only output records where a match is
possible */
/* current behavior is to go the next by group level and make closest
match */
data closest(drop = j min _start _stop);
merge ssmall ibig;
by &byvar1 &byvar2;
retain j 0;
if first.&byvar1 then j = _start;
i=j;
min=.;
top: i=i+1;
if i gt _stop then return;
/*read the ith obs from 'big' - rename cusip and &size to keep
separate*/
set sbig(rename=(cusip=concus &size=con&size &byvar1=c&byvar1
&byvar2=c&byvar2)) point=i end=eod;
if eod then stop;
if min=. then do;
min=abs(&size-con&size);
j=i;
goto top;
end;
else if abs(&size-con&size)= 0 then do;
output;
j=i;
return;
end;
else if abs(&size-con&size)<min then do;
min=abs(&size-con&size);
j=i;
goto top;
end;
else if abs(&size-con&size)=min then goto top;
else if abs(&size-con&size)>min then do;
i=j;
set sbig(rename=(cusip=concus &size=con&size &byvar1=c&byvar1
&byvar2=c&byvar2))
point=i end=eod;
output;
return;
end;
if eod then stop;
run;
/* smaller the testdata the worse the problem becomes */
/* I think when there is no possible match */
/* the algorithm goes to the next level (down) in the by group */
data review;
set closest;
d1 = &byvar1 - c&byvar1;
d2 = &byvar2 - c&byvar2;
if d1 ne 0 or d2 ne 0 then output;
run;