|
Dear Lara,
I downloaded the macro some time ago and the code below works perfectly
with the example from the original authors.
Hope that helps,
Oliver
** GMATCH Macro to match 1 or more controls for each of N cases
using the GREEDY algorithm--REPLACES GREEDY option of MATCH macro.
Changes:
--cases and controls in same dataset
--not mandatory to randomly pre-ort cases and controls, but
recommended
--options to transform X's and to choose distance metric
--input parameters consistent with %DIST macro for optimal matching
**;
/*
Macro name: %gmatch
Authors: Jon Kosanke and Erik Bergstralh
Date: July 23, 2003
October 31, 2003...tweaked print/means based on "time" var
Macro function:
Matching using the GREEDY algorithm
The purpose of this macro is to match 1 or more controls(from a total
of M) for each of N cases. The controls may be matched to the cases
by
one or more factors(X's). The control selected for a particular
case(i) will be the control(j) closest to the case in terms of Dij.
Dij can be defined in multiple ways. Common choices are the Euclidean
distance and the weighted sum of the absolute differences between the
case and control matching factors. I.e.,
Dij= SQRT [SUM { W.k*(X.ik-X.jk)**2} ], or
Dij= SUM { W.k*ABS(X.ik-X.jk) },
where the sum is over the number
of matching factors X(with index
k) and W.k = the weight assigned
to matching factor k and X.ik =
the value of variable X(k) for
subject i.
The control(j) selected for a case(i) is the one with the smallest Dij
(subject to constraints DMAX and DMAXK, defined below). In the case of
ties, the first one encountered will be used. The higher the
user-defined
weight, the more likely it is that the case and control will be
matched
on the factor. Assign large weights (relative to the other weights)
to
obtain exact matches for two-level factors such as gender. An option
to
using weights might be to standarize the X's in some fashion. The
macro
has options to standardize all X's to mean 0 and variance 1 and to use
ranks.
The matching algorithm used is the GREEDY method. Using the greedy
method,
once a match is made it is never broken. This may result in
inefficiencies
if a previously matched control would be a better match for the
current
case than those controls currently available. (An alternative method
is to
do optimal matching using the VMATCH & DIST macros. This method
guarantees
the best possible matched set in terms of minimizing the total Dij.)
The GREEDY method generally produces very good matches, especially if
the
control pool is large relative to the number of cases. When multiple
controls/case are desired, the algorithm first matches 1 control to
all
cases and then proceeds to select second controls.
The gmatch macro checks for missing values of matching variables and
the
time variable(if specified) and deletes those observations from the
input
dataset.
Call statement:
%gmatch(data=,group=,id=,
mvars=,wts=,dmaxk=,dmax=,transf,
time=, dist=,
ncontls=,seedca=,seedco=,
out=,outnmca=,outnmco=,print=);
Parameter definitions(R=required parameter):
R data SAS data set containing cases and potential controls. Must
contain the ID, GROUP, and the matching variables.
R group SAS variable defining cases. Group=1 if case, 0 if
control.
R id SAS CHARACTER ID variable for the cases and controls.
R mvars List of numeric matching variables common to both case and
control data sets. For example, mvars=male age birthyr.
R wts List of non-negative weights corresponding to each
matching
variable. For example wts=10 2 1 corresponding to male,
age
and birthyr as in the above example.
dmaxk List of non-negative values corresponding to each matching
variable. These numbers are the largest possible absolute
differences compatible with a valid match. Cases will
NOT be matched to a control if ANY of the INDIVIDUAL
matching factor differences are >DMAXK. This optional
parameter allows one to form matches of the type male+/-0,
age+/-2, birth year+/-5 by specifying DMAXK=0 2 5.
dmax Largest value of Dij considered to be a valid match. If
you want to match exactly on a two-level factor(such as
gender coded as 0 or 1) then assign DMAX to be less than
the weight for the factor. In the example above, one
could
use wt=10 for male and dmax=9. Leave DMAX blank if any
Dij is a valid match. One would typically NOT use both
DMAXK and DMAX. The only advantage to using both, would
be
to further restrict potential matches that meet the
DMAXK criteria.
dist Indicates type of distance to calculate.
1=weighted sum(over matching vars) of
absolute case-control differences(default)
2=weighted Euclidean distance
time Time variable used for risk set matching. Matches are
only
valid if the control time > case time. May need to
transf Indicates whether all matching vars are to be transformed
(using the combined case+control data) prior to computing
distances. 0=no(default),
1=standardize to mean 0 and variance 1,
2=use ranks of matching variables.
ncontls Indicates the number of controls to match to each case.
The
default is 1. With multiple controls per case, the
algorithm
will first match every case to one control and then again
match each case to a second control, etc. Controls
selected
on the first pass will be stronger matches than those
selected in
later rounds. The output data set contains a variable
(cont_n)
which indicates on which round the control was selected.
seedca Seed value used to randomly sort the cases prior to
matching. This positive integer will be used as input to
the RANUNI function. The greedy matching algorithm is
order dependent which, among other things means that
cases matched first will be on average more similar to
their controls than those matched last(as the number of
control choices will be limited). If the matching order
is related to confounding factors (possibly age or
calendar time) then biases may result. Therefore it is
generally considered good practice when using the GREEDY
method to randomly sort both the cases and controls
before beginning the matching process.
seedco Seed value used to randomly sort the controls prior to
matching using the GREEDY method. This seed value must
also be a positive integer.
print= Option to print data for matched cases. Use PRINT=y to
print data and PRINT=n or blank to not print. Default is y.
out=name of SAS data set containing the results of the matching
process. Unmatched cases are not included. See outnm
below. The default name is __out. This data set will have
the following layout:
Case_id Cont_id Cont_n Dij Delta_caco MVARS_ca MVARS_co
1 67 1 5.2 (Differences & actual
1 78 2 6.1 values for matching factors
2 52 1 2.9 for cases & controls)
2 92 2 3.1
. . . .
. . . .
outnmca=name of SAS data set containing NON-matched cases.
Default name is __nmca .
outnmco=name of SAS data set containing NON-matched controls.
Default name is __nmco .
References: Bergstralh, EJ and Kosanke JL(1995). Computerized
matching of controls. Section of Biostatistics
Technical Report 56. Mayo Foundation.
Example: 1-1 matching by male(exact), age(+-2) and year(+-5).
The wt for male is not relevant, as only exact matches
on male will be considered. The weight for age(2) is
double that for year(1).
%gmatch(data=all, group=ca_co,id=clinic,
mvars=male age_od yr_od,
wts=2 2 1, dmaxk=0 2 5,out=mtch,
seedca=87877,seedco=987973);
Copyright Mayo Clinic College of Medicine 2003, all rights reserved.
This software is free software; you can redistribute it and/or modify
it
under the terms of the GNU General Public License as published by the
Free
Software Foundation; either version 2 of the License, or (at your
option)
any later version.
This software is distributed in the hope that it will be useful, but
WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY
or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License
for
more details.
*************************************************************** */
%MACRO GMATCH(DATA=,GROUP=,ID=,
MVARS=,WTS=,DMAXK=,DMAX=,DIST=1,
NCONTLS=1, TIME=,TRANSF=0,
SEEDCA=,SEEDCO=,PRINT=y,
OUT=__OUT,OUTNMCA=__NMCA,OUTNMCO=__NMCO);
%LET BAD=0;
%IF %LENGTH(&DATA)=0 %THEN %DO;
%PUT ERROR: NO DATASET SUPPLIED;
%LET BAD=1;
%END;
%IF %LENGTH(&ID)=0 %THEN %DO;
%PUT ERROR: NO ID VARIABLE SUPPLIED;
%LET BAD=1;
%END;
%IF %LENGTH(&GROUP)=0 %THEN %DO;
%PUT ERROR: NO CASE(1)/CONTROL(0) GROUP VARIABLE SUPPLIED;
%LET BAD=1;
%END;
%IF %LENGTH(&MVARS)=0 %THEN %DO;
%PUT ERROR: NO MATCHING VARIABLES SUPPLIED;
%LET BAD=1;
%END;
%IF %LENGTH(&WTS)=0 %THEN %DO;
%PUT ERROR: NO WEIGHTS SUPPLIED;
%LET BAD=1;
%END;
%LET NVAR=0;
%DO %UNTIL(%SCAN(&MVARS,&NVAR+1,' ')= );
%LET NVAR=%EVAL(&NVAR+1);
%END;
%LET NWTS=0;
%DO %UNTIL(%QSCAN(&WTS,&NWTS+1,' ')= );
%LET NWTS=%EVAL(&NWTS+1);
%END;
%IF &NVAR^= &NWTS %THEN %DO;
%PUT ERROR: #VARS MUST EQUAL #WTS;
%LET BAD=1;
%END;
%LET NK=0;
%IF %QUOTE(&DMAXK)^= %THEN %DO %UNTIL(%QSCAN(&DMAXK,&NK+1,' ')= );
%LET NK=%EVAL(&NK+1);
%END;
%IF &NK>&NVAR %THEN %LET NK=&NVAR;
%DO I=1 %TO &NVAR;
%LET V&I=%SCAN(&MVARS,&I,' ');
%END;
%IF &NWTS>0 %THEN %DO;
DATA _NULL_;
%DO I=1 %TO &NWTS;
%LET W&I=%SCAN(&WTS,&I,' ');
IF &&W&I<0 THEN DO;
PUT 'ERROR: WEIGHTS MUST BE NON-NEGATIVE';
CALL SYMPUT('BAD','1');
END;
%END;
RUN;
%END;
%IF &NK>0 %THEN %DO;
DATA _NULL_;
%DO I=1 %TO &NK;
%LET K&I=%SCAN(&DMAXK,&I,' ');
IF &&K&I<0 THEN DO;
PUT 'ERROR: DMAXK VALUES MUST BE NON-NEGATIVE';
CALL SYMPUT('BAD','1');
END;
%END;
RUN;
%END;
%MACRO MAX1;
%IF &DMAX^= %THEN %DO;
& __D<=&DMAX
%END;
%DO I=1 %TO &NK;
& ABS(__CA&I-__CO&I)<=&&K&I
%END;
%MEND MAX1;
%macro greedy;
%GLOBAL BAD2;
data __CHECK; set &DATA;
__id=&id;
if __id="" then delete;
%DO I=1 %TO &NVAR;
IF %scan(&mvars,&i)=. THEN DELETE;
%END;
%IF &TIME^= %THEN %DO;
IF &TIME=. THEN DELETE;
%END;
run;
*** transform data if requested/separate cases & controls;
%if &transf=1 %then %do;
proc standard data=__check m=0 s=1 out=_stdzd; var &mvars;
data _caco;
set _stdzd;
%end;
%if &transf=2 %then %do;
proc rank data=__check out=_ranks; var &mvars;
data _caco;
set _ranks;
%end;
%if &transf=0 %then %do;
data _caco;
set __check;
%end;
DATA __CASE; SET _caco;
if &group=1;
DATA __CASE; SET __CASE END=EOF;
KEEP __IDCA __CA1-__CA&NVAR __R &mvars
%if &time^= %then %do;
__catime
%end;
;
__IDCA=&ID;
%if &time^= %then %do;
__catime=&time;
%end;
%DO I=1 %TO &NVAR;
__CA&I=&&V&I;
%END;
%if &seedca^= %then %do;
SEED=&SEEDCA;
__R=RANUNI( SEED );
%end;
%else %do;
__R=1;
%end;
IF EOF THEN CALL SYMPUT('NCA',_N_);
PROC SORT; BY __R __IDCA;
DATA __CONT; SET _caco;
if &group=0;
DATA __CONT; SET __CONT END=EOF;
KEEP __IDCO __CO1-__CO&NVAR __R &mvars
%if &time^= %then %do;
__cotime
%end;
;
__IDCO=&ID;
%if &time^= %then %do;
__cotime=&time;
%end;
%DO I=1 %TO &NVAR;
__CO&I=&&V&I;
%END;
%if &seedco^= %then %do;
SEED=&SEEDCo;
__R=RANUNI( SEED );
%end;
%else %do;
__R=1;
%end;
IF EOF THEN CALL SYMPUT('NCO',_N_);
RUN;
%LET BAD2=0;
%IF &NCO < %EVAL(&NCA*&NCONTLS) %THEN %DO;
%PUT ERROR: NOT ENOUGH CONTROLS TO MAKE REQUESTED MATCHES;
%LET BAD2=1;
%END;
%IF &BAD2=0 %THEN %DO;
PROC SORT; BY __R __IDCO;
DATA __MATCH;
KEEP __IDCA __CA1-__CA&NVAR __DIJ __MATCH __CONT_N
%if &time^= %then %do;
__catime __cotime
%end;
;
ARRAY __USED(&NCO) $ 1 _TEMPORARY_;
DO __I=1 TO &NCO;
__USED(__I)='0';
END;
DO __I=1 TO &NCONTLS;
DO __J=1 TO &NCA;
SET __CASE POINT=__J;
__SMALL=.;
__MATCH=.;
DO __K=1 TO &NCO;
IF __USED(__K)='0' THEN DO;
SET __CONT POINT=__K;
%if &dist=2 %then %do;
**wtd euclidian dist;
__D= sqrt(
%do k=1 %to &nvar;
%scan(&wts,&k)*(__ca&k - __co&k)**2
%if &k<&nvar %then + ;
%end;
);
%end;
%else %do;
**wtd sum absolute diff;
__D=
%do k=1 %to &nvar;
%scan(&wts,&k)*abs(__ca&k - __co&k )
%if &k<&nvar %then + ;
%end;
;
%end;
IF __d^=. & (__SMALL=. | __D<__SMALL) %MAX1
%if &time^= %then %do;
& __cotime > __catime
%end;
THEN DO;
__SMALL=__D;
__MATCH=__K;
__DIJ=__D;
__CONT_N=__I;
END;
END;
END;
IF __MATCH^=. THEN DO;
__USED(__MATCH)='1';
OUTPUT;
END;
END;
END;
STOP;
DATA &OUT;
SET __MATCH;
SET __CONT POINT=__MATCH;
KEEP __IDCA __IDCO __CONT_N __DIJ __CA1-__CA&NVAR
__CO1-__CO&NVAR __d1-__d&nvar __absd1-__absd&nvar
__WT1-__WT&NVAR
__catime __cotime __dtime;
%if &time= %then %do;
__cotime=.; __catime=.;
%end;
LABEL
__catime="&time/CASE"
__cotime="&time/CONTROL"
__dtime="&time/ABS. DIFF"
__CONT_N='CONTROL/NUMBER'
__DIJ='DISTANCE/D_IJ'
%DO I=1 %TO &NVAR;
__CA&I="&&V&I/CASE"
__CO&I="&&V&I/CONTROL"
__absd&I="&&V&I/ABS. DIFF "
__d&I="&&V&I/DIFF "
__WT&I="&&V&I/WEIGHT"
%END;
;
%DO I=1 %TO &NVAR;
__d&i= (__CA&I-__CO&I); **raw diff;
__absd&I=abs(__CA&I-__CO&I); **abs diff;
__WT&I=&&W&I;
%END;
__dtime=__cotime-__catime;
PROC SORT DATA=&OUT; BY __IDCA __CONT_N;
proc sort data=__case; by __IDCA;
data &outnmca; merge __case
&out(in=__inout where=(__cont_n=1)); by __idca;
if __inout=0; **non-matches;
proc sort data=__cont; by __IDCO;
proc sort data=&out; by __IDCO;
data &outnmco; merge __cont
&out(in=__inout); by __idco;
if __inout=0; **non-matched controls;
proc sort data=&out; by __IDCA; **re-sort by case id;
%if %upcase(&print)=Y %then %do;
PROC PRINT data=&out LABEL SPLIT='/';
VAR __IDCA __IDCO __CONT_N
__DIJ
%DO I=1 %TO &NVAR;
__absd&I
%END;
%if &time^= %then %do;
__dtime
%end;
%DO I=1 %TO &NVAR;
__CA&I __CO&I
%END;
%if &time^= %then %do;
__catime __cotime
%end;
;
sum __dij;
title9'Data listing for matched cases and controls';
footnote"Greedy matching(gmatch) macro: data=&data
group=&group id=&id ";
footnote2" mvars=&mvars wts=&wts dmaxk=&dmaxk dmax=&dmax
ncontls=&ncontls";
footnote3" transf=&transf dist=&dist time=&time
seedca=&seedca seedco=&seedco";
footnote4" out=&out outnmca=&outnmca outnmco=&outnmco";
run;
title9'Summary data for matched cases and controls--one
obs/control';
%if &sysver ge 8 %then %do;
proc means data=&out maxdec=3 fw=8
n mean median min p10 p25 p75 p90 max sum;
%end;
%else %do;
proc means data=&out maxdec=3
n mean min max sum;
%end;
class __cont_n;
var __dij
%do I=1 %TO &NVAR;
__absd&I
%end;
%if &time^= %then %do;
__dtime
%end;
%do I=1 %TO &NVAR;
__ca&I
%end;
%if &time^= %then %do;
__catime
%end;
%do I=1 %TO &NVAR;
__co&I
%end;
%if &time^= %then %do;
__cotime
%end;
;
run;
*** estimate matching var means within matched sets for
controls;
proc means data=&out n mean noprint; by __idca;
var __dij
%do i=1 %to &nvar;
__co&i
%end;
__cotime
;
output out=_mcont n=n_co mean=__dijm
%do i=1 %to &nvar;
__com&i
%end;
__tcom
;
data _onecase; set &out; by __idca; if first.__idca;
data __camcon; merge _onecase _mcont; by __idca;
keep __idca n_co __dijm
__dtime __catime __tcom
%do i=1 %to &nvar;
__ca&i __com&i __actd&i __absd&i
%end;
;
%do i=1 %to &nvar;
__absd&i=abs(__ca&i - __com&i);
__actd&i=(__ca&i - __com&i);
%end;
__dtime=__tcom-__catime
;
label
n_co="No./CONTROLS"
__dijm="Average/Dij"
__dtime="&time/Mean Time DIFF"
__tcom="&time/Mean CONT TIME"
%do i=1 %to &nvar; %let vvar=%scan(&mvars,&i);
__absd&i="&vvar/Mean ABS. DIFF"
__com&i="&vvar/Mean CONTROL"
%end;
;
title9'Summary data for matched cases and controls--one
obs/case(using average control value)';
%if &sysver ge 8 %then %do;
proc means data=__camcon maxdec=3 fw=8
n mean median min p10 p25 p75 p90 max sum;
%end;
%else %do;
proc means data=__camcon maxdec=3
n mean min max sum;
%end;
var n_co __dijm
%do i=1 %to &nvar;
__absd&i
%end;
%if &time^= %then %do;
__dtime
%end;
%do i=1 %to &nvar;
__ca&i
%end;
%if &time^= %then %do;
__catime
%end;
%do i=1 %to &nvar;
__com&i
%end;
%if &time^= %then %do;
__tcom
%end;
;
%end; **end of print=y loop**;
%END; **end of bad2=0 loop**;
run;
title9; footnote;
run;
%mend greedy;
%IF &BAD=0 %THEN %DO;
%GREEDY
%END;
%MEND GMATCH;
**test data;
DATA FAKEREG;
DO I = 1 TO 3000;
*id = _n_;
id=i;
SEX = MOD(I, 2);
IF mod(I, 21) THEN CASE = 0; ELSE CASE = 1;
*AGE = (INT(RANUNI(12378937)*10000)/100);
age= int(ranuni(123789837)*100);
DROP i ;
OUTPUT;
END;
RUN;
data fakereg;set fakereg; timex=5; if case=0 then timex=6;
proc print data=fakereg;run;
%gmatch(data=fakereg,group=case, id=id,
mvars=age sex,wts=2 1,dmaxk= 5 0, transf=0,
time=timex, dist=1, ncontls=2,seedca=234098,seedco=0489,
out=regccout,outnmco=matched,print=Y);
run;
stephensed@gmail.com schrieb:
> I'm a relatively new SAS user, and trying to use a macro in SAS for the
> first time. The macro I'd like to use is the gmatch macro found at
> http://cancercenter.mayo.edu/mayo/research/biostat/upload/gmatch.sas,
> which I'd use to complete a 1:1 case-control match.
>
> The problem I encounter is that when I use the basic structure of the
> macro (with my specifications) as follows:
>
> %MACRO GMATCH(DATA=,GROUP=,ID=,
> MVARS=,WTS=,DMAXK=,DMAX=,DIST=1,
> NCONTLS=1, TIME=,TRANSF=0,
> SEEDCA=,SEEDCO=,PRINT=y,
> OUT=__OUT,OUTNMCA=__NMCA,OUTNMCO=__NMCO);
> run;
>
> I get a warning from SAS, saying that I'm "missing %MEND statement",
> which prevents the completion of the program. So I tried adding
> "%mend;" to the end of the program in several ways (with "run;"
> following or preceding the "%mend;", with "end;" or "%end;" preceding
> the "%mend;", and just about all computations of these commands). Even
> when I include "%mend;" at the end of the program, I still get the same
> warning message that I'm "missing %MEND statement."
>
> Any idea what I'm doing wrong? Or suggestions how to run the program
> successfully?
>
> Best regards,
> Lara Heflin
|