LISTSERV at the University of Georgia
Menubar Imagemap
Home Browse Manage Request Manuals Register
Previous messageNext messagePrevious in topicNext in topicPrevious by same authorNext by same authorPrevious page (September 2004, week 4)Back to main SAS-L pageJoin or leave SAS-L (or change settings)ReplyPost a new messageSearchProportional fontNon-proportional font
Date:         Fri, 24 Sep 2004 10:00:36 -0500
Reply-To:     "Dunn, Toby" <Toby.Dunn@TEA.STATE.TX.US>
Sender:       "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From:         "Dunn, Toby" <Toby.Dunn@TEA.STATE.TX.US>
Subject:      Re: Merging two datasets by closeness of common variable
Comments: To: Michael Murff <mjm33@MSM1.BYU.EDU>
Content-Type: text/plain; charset="us-ascii"

Actually, Paul has a wonderful paper on do-loop processing called "The Magnificent Do", which delves into the wonderfully complex world of do-loops. Also, he has written many papers on hashing, which has caused more than one programmer to scratch their head in disbelief and ask for the aspirin bottle. In short if there were a person I needed to build a complex system of matching and merging Paul or Ian would be the people I would want.

Toby Dunn

-----Original Message----- From: SAS(r) Discussion [mailto:SAS-L@LISTSERV.UGA.EDU] On Behalf Of Michael Murff Sent: Friday, September 24, 2004 9:50 AM To: SAS-L@LISTSERV.UGA.EDU Subject: Re: Merging two datasets by closeness of common variable

Thank you Paul et al. It will take me a few days to digest your solution and study its properties with real data. I did run it and noted that it did not "hang" as in the previous instance. The company that was repeated previously was the largest in the control dataset; I think what happened was that all the smaller companies were used and the largest one was the only one that met the rule thus necessitating its repeated use for the last several iterations.

No matter, the present solution seems to work well and will assist me for some time to come. What's more, it has given me a glimpse of what array processing can do. You have my vote to publish the definitive book on array processing and algorithm development in SAS; something like "Numerical Recipes and Algorithm Development in SAS: An Array Processing Approach." Maybe SI has already Proc'ed most of the useful numerical routines, but I have yet to see a text with algorithmic lessons pertaining to SAS e.g. recursion, complex looping, array processing (and other classic CS topics you would traditionally see addressed in C or other development languages).


Michael J. Murff Research Associate Finance Group (Business Man.) Marriott School of Management Brigham Young University (801) 422-4933 murff at byu dot edu

>>> "Paul M. Dorfman" <sashole@BELLSOUTH.NET> 9/23/2004 1:32:34 PM >>> Michael,

Well, Venky Chakravarthy reminded me offline that BS can stand not only for the binary search (not to mention practically everything else), but also for the Brute Force. He also called me cruel. Owning up to the fairness of the accusation, I will nonetheless risk another cruelty by submitting this approach, which precisely follows Venky's interpretation of the ubiquitous abbreviation:

ata small large ; _n_ = 100 ; do id = 20100 to 1 by -1 ; size = ceil (ranuni (1) * 1e9) ; *long enough to be nodup ; if ranuni (1) < _n_ / id then do ; output small ; _n_ +- 1 ; end ; else output large ; end ; run ;

data match (drop = _:) ; array sz [1 : 99999] _temporary_ ; *big enough; if _n_ = 1 then do p = 1 to n ; set large nobs = n point = p ; sz [p] = size ; end ; set small ; _min_sd = constant ('big') ; do p = 1 to n ; if sz [p] = . then continue ; _sd = (size - sz [p]) ** 2 ; if _sd => _min_sd then continue ; _min_sd = _sd ; _min_pt = p ; end ; p = _min_pt ; set large (rename = (id = l_id size = l_size)) point = p ; sz [p] = . ; run ;

On the LARGE side, ties are still killed by voiding the cell having already being chosen in a previous observation of SMALL, but all the rest of the LARGE sized are considered and the one with the minimal square sum of differences. I looked at the distribution of the resulting differences, and it seemed to not favor earlier observations from SMALL to any appreciable degree.

Kind regards, ---------------- Paul M. Dorfman Jacksonville, FL ----------------

> -----Original Message----- > From: Michael Murff [] > Sent: Wednesday, September 22, 2004 12:01 PM > To: > Cc: SAS-L@LISTSERV.UGA.EDU > Subject: RE: Merging two datasets by closeness of common variable > > Hi SAS-L, > > Here is a summary of an offline converasation with Paul: > > Again, many thanks for your interest in this problem. Paul > said, relating to his previously posted solution: > > <snip> > >when an item has been picked, the next SIZE from SMALL has a > chance to > >be closer to preceding item form LARGE than to the next. > > ,which is the most important problem that prior solutions > have had. I would count its solution a major victory. > > <snip> > >You are saying minimize the > >sum of squared difference residuals... Say you have a SIZE > from SMALL, > how > >in your mind the heuristics of the process would go to obtain the > entry from > >LARGE you deem appropriate? > > The idea of minimizing the sum of squared residuals (of the matchs to > small) may > be a too esoteric and too much work to implement from > scratch, but I will try to describe what I was thinking: > > The final match set would be chosen to minimize the overall > sum of squares of the size differences for the match set. > This would ensure that ties are broken optimally, so that > assignment of control company A to sample company 1 would not > necessitate a very poor assignment of control company B to > sample company 2. > > IF the sum of squares of the second potential pairing > (COB-->CO2) is less than (COA-->CO1), then make the second > assignment not the first. Of course such an algorithm would > make sure that every pairing was superior to every other > potential pairing to the end that the overall fit produces > the minimum sum of differences (residuals). If such a pairing > were actually found, then one could regress sample size on > control size and get the best possible fit given the universe > of potential match > sets: the minimum sum of squares relative to any other set of > pairings that could have occurred. > > Please don't take this discussion as a desired spec per se; > rather I am just trying to flesh out my previous idea. > If Paul has already written something that resolves the match > quality decay problem (per private email) that would be of interest. > (My hope is to retain some goodwill for a future day not to > mention that I'm sure you all have other high value > activities beyond coding to spec for a distant -l > acquaintance such as myself). > > Best, > > Michael Murff >

Back to: Top of message | Previous page | Main SAS-L page