Date: Tue, 23 Nov 2004 11:27:31 -0700
Reply-To: Michael Murff <mjm33@MSM1.BYU.EDU>
Sender: "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From: Michael Murff <mjm33@MSM1.BYU.EDU>
Subject: Re: Greedy match with by processing (relist with problem mockup)
Content-Type: text/plain; charset=US-ASCII
Many thanks Dennis, but as far as I can tell, the subsetting if
statement actually produces no matches at all with the sample data and
mockup routine I posted. Are the parentheses only stylistic or do they
apply the not to both _big and _small? (My reading is the former). My
sense is that the in merge flags are two early in the step to address
this issue; I conjecture this because I have tried a multitude of
in-flag combinations. I wonder if Randy R. we would care to comment of
this one as he probably understands the guts of algorithm better than I?
Also, could anyone speak to the behind the scenes effects using a merge
with multiple sets; what is really going on record by record?
Michael
>>> Dennis Diskin <diskin@SNET.NET> 11/23/2004 10:37:21 AM >>>
Mike,
My message must have gotten lost somewhere.
Ehat you need is something like:
data closest(drop = j min _start _stop);
merge ssmall (in=_small) ibig(in=_big);
by &byvar1 &byvar2;
if not (_big and _small);
HTH,
Dennis Diskin
Michael Murff <mjm33@MSM1.BYU.EDU> wrote:
Hi SAS-L,
This greedy matching algorithm is almost right, but still outputs the
records where a match is not possible. Imagine a sample dataset with
two
observations that are members of a rather obscure level of the by
group.
And the control dataset (big) only has one matching record. The
program
is jumping to the next available match in the next by group level (see
review dataset where d2 is not equal to zero) of the big dataset. I
need
to output a missing value or not output at all. Can one of the gurus
take a minute to run this code? The basic idea is to create a dataset
that contains the start and stop position for each by group (on the
sorted big dataset) and then use these record numbers to simulate
by-group boundaries so that matching is only possible within a
particular by-group pair; thanks to an offline email from Dennis D.
run;
Michael Murff
***************************************************;
/* analysis variables */
%let size = size;
%let byvar1 = byvar1;
%let byvar2 = byvar2;
/* mockup dataset */
data testdata;
length cusip $8.;
do i=1 to 2000;
do j=1 to 100; /* simulate by group levels */
do k=1 to 5;
cusip = compress("A"||ranuni(1),.);
size = round(ranuni(5)*1000,.001);
output;
end;
end;
end;
drop i;
rename j = &byvar1 k = &byvar2;
run;
/* eliminate duplicates */
proc sort data=testdata nodupkey; by cusip;
run;
/* create sample (small) and control (big) */
data big small;
set testdata;
if _N_ lt 101 then output small;
else output big;
run;
proc sort data=small out=ssmall;
by &byvar1 &byvar2 descending &size;
run;
proc sort data=big out=sbig;
by &byvar1 &byvar2 descending &size;
run;
/* create a dataset with the start and end record numbers in BIG for
each set of BY variables */
data ibig(keep=&byvar1 &byvar2 _start _stop);
set sbig;
by &byvar1 &byvar2;
retain _start;
if first.&byvar1 then _start = _N_;
if last.&byvar2;
_stop=_N_;
run;
/* find closest match bet. sample and control groups */
/* this step seems is output records where no match is possible */
/* need to constrain it to only output records where a match is
possible */
/* current behavior is to go the next by group level and make closest
match */
data closest(drop = j min _start _stop);
merge ssmall ibig;
by &byvar1 &byvar2;
retain j 0;
if first.&byvar1 then j = _start;
i=j;
min=.;
top: i=i+1;
if i gt _stop then return;
/*read the ith obs from 'big' - rename cusip and &size to keep
separate*/
set sbig(rename=(cusip=concus &size=con&size &byvar1=c&byvar1
&byvar2=c&byvar2)) point=i end=eod;
if eod then stop;
if min=. then do;
min=abs(&size-con&size);
j=i;
goto top;
end;
else if abs(&size-con&size)= 0 then do;
output;
j=i;
return;
end;
else if abs(&size-con&size) min=abs(&size-con&size);
j=i;
goto top;
end;
else if abs(&size-con&size)=min then goto top;
else if abs(&size-con&size)>min then do;
i=j;
set sbig(rename=(cusip=concus &size=con&size &byvar1=c&byvar1
&byvar2=c&byvar2))
point=i end=eod;
output;
return;
end;
if eod then stop;
run;
/* smaller the testdata the worse the problem becomes */
/* I think when there is no possible match */
/* the algorithm goes to the next level (down) in the by group */
data review;
set closest;
d1 = &byvar1 - c&byvar1;
d2 = &byvar2 - c&byvar2;
if d1 ne 0 or d2 ne 0 then output;
run;