|
Actually, Paul has a wonderful paper on do-loop processing called "The
Magnificent Do", which delves into the wonderfully complex world of
do-loops. Also, he has written many papers on hashing, which has caused
more than one programmer to scratch their head in disbelief and ask for
the aspirin bottle. In short if there were a person I needed to build a
complex system of matching and merging Paul or Ian would be the people I
would want.
Toby Dunn
-----Original Message-----
From: SAS(r) Discussion [mailto:SAS-L@LISTSERV.UGA.EDU] On Behalf Of
Michael Murff
Sent: Friday, September 24, 2004 9:50 AM
To: SAS-L@LISTSERV.UGA.EDU
Subject: Re: Merging two datasets by closeness of common variable
Thank you Paul et al. It will take me a few days to digest your solution
and study its properties with real data. I did run it and noted that it
did not "hang" as in the previous instance. The company that was
repeated previously was the largest in the control dataset; I think what
happened was that all the smaller companies were used and the largest
one was the only one that met the rule thus necessitating its repeated
use for the last several iterations.
No matter, the present solution seems to work well and will assist me
for some time to come. What's more, it has given me a glimpse of what
array processing can do. You have my vote to publish the definitive book
on array processing and algorithm development in SAS; something like
"Numerical Recipes and Algorithm Development in SAS: An Array Processing
Approach." Maybe SI has already Proc'ed most of the useful numerical
routines, but I have yet to see a text with algorithmic lessons
pertaining to SAS e.g. recursion, complex looping, array processing (and
other classic CS topics you would traditionally see addressed in C or
other development languages).
Best,
Michael J. Murff
Research Associate
Finance Group (Business Man.)
Marriott School of Management
Brigham Young University
(801) 422-4933
murff at byu dot edu
>>> "Paul M. Dorfman" <sashole@BELLSOUTH.NET> 9/23/2004 1:32:34 PM >>>
Michael,
Well, Venky Chakravarthy reminded me offline that BS can stand not only
for
the binary search (not to mention practically everything else), but
also for
the Brute Force. He also called me cruel. Owning up to the fairness of
the
accusation, I will nonetheless risk another cruelty by submitting this
approach, which precisely follows Venky's interpretation of the
ubiquitous
abbreviation:
ata small large ;
_n_ = 100 ;
do id = 20100 to 1 by -1 ;
size = ceil (ranuni (1) * 1e9) ; *long enough to be nodup ;
if ranuni (1) < _n_ / id then do ;
output small ;
_n_ +- 1 ;
end ;
else output large ;
end ;
run ;
data match (drop = _:) ;
array sz [1 : 99999] _temporary_ ; *big enough;
if _n_ = 1 then do p = 1 to n ;
set large nobs = n point = p ;
sz [p] = size ;
end ;
set small ;
_min_sd = constant ('big') ;
do p = 1 to n ;
if sz [p] = . then continue ;
_sd = (size - sz [p]) ** 2 ;
if _sd => _min_sd then continue ;
_min_sd = _sd ;
_min_pt = p ;
end ;
p = _min_pt ;
set large (rename = (id = l_id size = l_size)) point = p ;
sz [p] = . ;
run ;
On the LARGE side, ties are still killed by voiding the cell having
already
being chosen in a previous observation of SMALL, but all the rest of
the
LARGE sized are considered and the one with the minimal square sum of
differences. I looked at the distribution of the resulting differences,
and
it seemed to not favor earlier observations from SMALL to any
appreciable
degree.
Kind regards,
----------------
Paul M. Dorfman
Jacksonville, FL
----------------
> -----Original Message-----
> From: Michael Murff [mailto:mjm33@msm1.byu.edu]
> Sent: Wednesday, September 22, 2004 12:01 PM
> To: sashole@bellsouth.net
> Cc: SAS-L@LISTSERV.UGA.EDU
> Subject: RE: Merging two datasets by closeness of common variable
>
> Hi SAS-L,
>
> Here is a summary of an offline converasation with Paul:
>
> Again, many thanks for your interest in this problem. Paul
> said, relating to his previously posted solution:
>
> <snip>
> >when an item has been picked, the next SIZE from SMALL has a
> chance to
> >be closer to preceding item form LARGE than to the next.
>
> ,which is the most important problem that prior solutions
> have had. I would count its solution a major victory.
>
> <snip>
> >You are saying minimize the
> >sum of squared difference residuals... Say you have a SIZE
> from SMALL,
> how
> >in your mind the heuristics of the process would go to obtain the
> entry from
> >LARGE you deem appropriate?
>
> The idea of minimizing the sum of squared residuals (of the matchs
to
> small) may
> be a too esoteric and too much work to implement from
> scratch, but I will try to describe what I was thinking:
>
> The final match set would be chosen to minimize the overall
> sum of squares of the size differences for the match set.
> This would ensure that ties are broken optimally, so that
> assignment of control company A to sample company 1 would not
> necessitate a very poor assignment of control company B to
> sample company 2.
>
> IF the sum of squares of the second potential pairing
> (COB-->CO2) is less than (COA-->CO1), then make the second
> assignment not the first. Of course such an algorithm would
> make sure that every pairing was superior to every other
> potential pairing to the end that the overall fit produces
> the minimum sum of differences (residuals). If such a pairing
> were actually found, then one could regress sample size on
> control size and get the best possible fit given the universe
> of potential match
> sets: the minimum sum of squares relative to any other set of
> pairings that could have occurred.
>
> Please don't take this discussion as a desired spec per se;
> rather I am just trying to flesh out my previous idea.
> If Paul has already written something that resolves the match
> quality decay problem (per private email) that would be of interest.
> (My hope is to retain some goodwill for a future day not to
> mention that I'm sure you all have other high value
> activities beyond coding to spec for a distant -l
> acquaintance such as myself).
>
> Best,
>
> Michael Murff
>
|