LISTSERV at the University of Georgia
Menubar Imagemap
Home Browse Manage Request Manuals Register
Previous (more recent) messageNext (less recent) messagePrevious (more recent) in topicNext (less recent) in topicPrevious (more recent) by same authorNext (less recent) by same authorPrevious page (November 2004, week 4)Back to main SAS-L pageJoin or leave SAS-L (or change settings)ReplyPost a new messageSearchProportional fontNon-proportional font
Date:         Tue, 23 Nov 2004 11:27:31 -0700
Reply-To:     Michael Murff <mjm33@MSM1.BYU.EDU>
Sender:       "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From:         Michael Murff <mjm33@MSM1.BYU.EDU>
Subject:      Re: Greedy match with by processing (relist with problem mockup)
Comments: To: diskin@SNET.NET
Content-Type: text/plain; charset=US-ASCII

Many thanks Dennis, but as far as I can tell, the subsetting if statement actually produces no matches at all with the sample data and mockup routine I posted. Are the parentheses only stylistic or do they apply the not to both _big and _small? (My reading is the former). My sense is that the in merge flags are two early in the step to address this issue; I conjecture this because I have tried a multitude of in-flag combinations. I wonder if Randy R. we would care to comment of this one as he probably understands the guts of algorithm better than I? Also, could anyone speak to the behind the scenes effects using a merge with multiple sets; what is really going on record by record?

Michael

>>> Dennis Diskin <diskin@SNET.NET> 11/23/2004 10:37:21 AM >>> Mike,

My message must have gotten lost somewhere. Ehat you need is something like:

data closest(drop = j min _start _stop); merge ssmall (in=_small) ibig(in=_big); by &byvar1 &byvar2; if not (_big and _small);

HTH, Dennis Diskin

Michael Murff <mjm33@MSM1.BYU.EDU> wrote: Hi SAS-L,

This greedy matching algorithm is almost right, but still outputs the records where a match is not possible. Imagine a sample dataset with two observations that are members of a rather obscure level of the by group. And the control dataset (big) only has one matching record. The program is jumping to the next available match in the next by group level (see review dataset where d2 is not equal to zero) of the big dataset. I need to output a missing value or not output at all. Can one of the gurus take a minute to run this code? The basic idea is to create a dataset that contains the start and stop position for each by group (on the sorted big dataset) and then use these record numbers to simulate by-group boundaries so that matching is only possible within a particular by-group pair; thanks to an offline email from Dennis D.

run;

Michael Murff

***************************************************; /* analysis variables */ %let size = size; %let byvar1 = byvar1; %let byvar2 = byvar2;

/* mockup dataset */ data testdata; length cusip $8.; do i=1 to 2000; do j=1 to 100; /* simulate by group levels */ do k=1 to 5; cusip = compress("A"||ranuni(1),.); size = round(ranuni(5)*1000,.001); output; end; end; end; drop i; rename j = &byvar1 k = &byvar2; run;

/* eliminate duplicates */ proc sort data=testdata nodupkey; by cusip; run;

/* create sample (small) and control (big) */ data big small; set testdata; if _N_ lt 101 then output small; else output big; run;

proc sort data=small out=ssmall; by &byvar1 &byvar2 descending &size; run;

proc sort data=big out=sbig; by &byvar1 &byvar2 descending &size; run;

/* create a dataset with the start and end record numbers in BIG for each set of BY variables */

data ibig(keep=&byvar1 &byvar2 _start _stop); set sbig; by &byvar1 &byvar2; retain _start; if first.&byvar1 then _start = _N_; if last.&byvar2; _stop=_N_; run;

/* find closest match bet. sample and control groups */ /* this step seems is output records where no match is possible */ /* need to constrain it to only output records where a match is possible */ /* current behavior is to go the next by group level and make closest match */

data closest(drop = j min _start _stop); merge ssmall ibig; by &byvar1 &byvar2;

retain j 0; if first.&byvar1 then j = _start;

i=j; min=.; top: i=i+1; if i gt _stop then return;

/*read the ith obs from 'big' - rename cusip and &size to keep separate*/ set sbig(rename=(cusip=concus &size=con&size &byvar1=c&byvar1 &byvar2=c&byvar2)) point=i end=eod; if eod then stop; if min=. then do; min=abs(&size-con&size); j=i; goto top; end; else if abs(&size-con&size)= 0 then do; output; j=i; return; end; else if abs(&size-con&size) min=abs(&size-con&size); j=i; goto top; end; else if abs(&size-con&size)=min then goto top; else if abs(&size-con&size)>min then do; i=j; set sbig(rename=(cusip=concus &size=con&size &byvar1=c&byvar1 &byvar2=c&byvar2)) point=i end=eod; output; return; end; if eod then stop; run;

/* smaller the testdata the worse the problem becomes */ /* I think when there is no possible match */ /* the algorithm goes to the next level (down) in the by group */ data review; set closest; d1 = &byvar1 - c&byvar1; d2 = &byvar2 - c&byvar2; if d1 ne 0 or d2 ne 0 then output; run;


Back to: Top of message | Previous page | Main SAS-L page