LISTSERV at the University of Georgia
Menubar Imagemap
Home Browse Manage Request Manuals Register
Previous (more recent) messageNext (less recent) messagePrevious (more recent) in topicNext (less recent) in topicPrevious (more recent) by same authorNext (less recent) by same authorPrevious page (November 2004, week 4)Back to main SAS-L pageJoin or leave SAS-L (or change settings)ReplyPost a new messageSearchProportional fontNon-proportional font
Date:         Tue, 23 Nov 2004 10:13:24 -0700
Reply-To:     Michael Murff <mjm33@MSM1.BYU.EDU>
Sender:       "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From:         Michael Murff <mjm33@MSM1.BYU.EDU>
Subject:      Greedy match with by processing (relist with problem mockup)
Content-Type: text/plain; charset=US-ASCII

Hi SAS-L,

This greedy matching algorithm is almost right, but still outputs the records where a match is not possible. Imagine a sample dataset with two observations that are members of a rather obscure level of the by group. And the control dataset (big) only has one matching record. The program is jumping to the next available match in the next by group level (see review dataset where d2 is not equal to zero) of the big dataset. I need to output a missing value or not output at all. Can one of the gurus take a minute to run this code? The basic idea is to create a dataset that contains the start and stop position for each by group (on the sorted big dataset) and then use these record numbers to simulate by-group boundaries so that matching is only possible within a particular by-group pair; thanks to an offline email from Dennis D.

run;

Michael Murff

***************************************************; /* analysis variables */ %let size = size; %let byvar1 = byvar1; %let byvar2 = byvar2;

/* mockup dataset */ data testdata; length cusip $8.; do i=1 to 2000; do j=1 to 100; /* simulate by group levels */ do k=1 to 5; cusip = compress("A"||ranuni(1),.); size = round(ranuni(5)*1000,.001); output; end; end; end; drop i; rename j = &byvar1 k = &byvar2; run;

/* eliminate duplicates */ proc sort data=testdata nodupkey; by cusip; run;

/* create sample (small) and control (big) */ data big small; set testdata; if _N_ lt 101 then output small; else output big; run;

proc sort data=small out=ssmall; by &byvar1 &byvar2 descending &size; run;

proc sort data=big out=sbig; by &byvar1 &byvar2 descending &size; run;

/* create a dataset with the start and end record numbers in BIG for each set of BY variables */

data ibig(keep=&byvar1 &byvar2 _start _stop); set sbig; by &byvar1 &byvar2; retain _start; if first.&byvar1 then _start = _N_; if last.&byvar2; _stop=_N_; run;

/* find closest match bet. sample and control groups */ /* this step seems is output records where no match is possible */ /* need to constrain it to only output records where a match is possible */ /* current behavior is to go the next by group level and make closest match */

data closest(drop = j min _start _stop); merge ssmall ibig; by &byvar1 &byvar2;

retain j 0; if first.&byvar1 then j = _start;

i=j; min=.; top: i=i+1; if i gt _stop then return;

/*read the ith obs from 'big' - rename cusip and &size to keep separate*/ set sbig(rename=(cusip=concus &size=con&size &byvar1=c&byvar1 &byvar2=c&byvar2)) point=i end=eod; if eod then stop; if min=. then do; min=abs(&size-con&size); j=i; goto top; end; else if abs(&size-con&size)= 0 then do; output; j=i; return; end; else if abs(&size-con&size)<min then do; min=abs(&size-con&size); j=i; goto top; end; else if abs(&size-con&size)=min then goto top; else if abs(&size-con&size)>min then do; i=j; set sbig(rename=(cusip=concus &size=con&size &byvar1=c&byvar1 &byvar2=c&byvar2)) point=i end=eod; output; return; end; if eod then stop; run;

/* smaller the testdata the worse the problem becomes */ /* I think when there is no possible match */ /* the algorithm goes to the next level (down) in the by group */ data review; set closest; d1 = &byvar1 - c&byvar1; d2 = &byvar2 - c&byvar2; if d1 ne 0 or d2 ne 0 then output; run;


Back to: Top of message | Previous page | Main SAS-L page