Date: Tue, 18 Jul 2006 23:21:18 -0700 Oliver.Kuss@MEDIZIN.UNI-HALLE.DE "SAS(r) Discussion" Oliver.Kuss@MEDIZIN.UNI-HALLE.DE http://groups.google.com Re: gmatch matching macro help requested To: sas-l@uga.edu text/plain; charset="iso-8859-1"

Dear Lara, I downloaded the macro some time ago and the code below works perfectly with the example from the original authors.

Hope that helps, Oliver

** GMATCH Macro to match 1 or more controls for each of N cases using the GREEDY algorithm--REPLACES GREEDY option of MATCH macro. Changes: --cases and controls in same dataset --not mandatory to randomly pre-ort cases and controls, but recommended --options to transform X's and to choose distance metric --input parameters consistent with %DIST macro for optimal matching **;

/* Macro name: %gmatch

Authors: Jon Kosanke and Erik Bergstralh

Date: July 23, 2003 October 31, 2003...tweaked print/means based on "time" var

Macro function:

Matching using the GREEDY algorithm

The purpose of this macro is to match 1 or more controls(from a total of M) for each of N cases. The controls may be matched to the cases by one or more factors(X's). The control selected for a particular case(i) will be the control(j) closest to the case in terms of Dij. Dij can be defined in multiple ways. Common choices are the Euclidean distance and the weighted sum of the absolute differences between the case and control matching factors. I.e.,

Dij= SQRT [SUM { W.k*(X.ik-X.jk)**2} ], or

Dij= SUM { W.k*ABS(X.ik-X.jk) },

where the sum is over the number of matching factors X(with index k) and W.k = the weight assigned to matching factor k and X.ik = the value of variable X(k) for subject i.

The control(j) selected for a case(i) is the one with the smallest Dij (subject to constraints DMAX and DMAXK, defined below). In the case of ties, the first one encountered will be used. The higher the user-defined weight, the more likely it is that the case and control will be matched on the factor. Assign large weights (relative to the other weights) to obtain exact matches for two-level factors such as gender. An option to using weights might be to standarize the X's in some fashion. The macro has options to standardize all X's to mean 0 and variance 1 and to use ranks.

The matching algorithm used is the GREEDY method. Using the greedy method, once a match is made it is never broken. This may result in inefficiencies if a previously matched control would be a better match for the current case than those controls currently available. (An alternative method is to do optimal matching using the VMATCH & DIST macros. This method guarantees the best possible matched set in terms of minimizing the total Dij.) The GREEDY method generally produces very good matches, especially if the control pool is large relative to the number of cases. When multiple controls/case are desired, the algorithm first matches 1 control to all cases and then proceeds to select second controls.

The gmatch macro checks for missing values of matching variables and the time variable(if specified) and deletes those observations from the input dataset.

Call statement:

%gmatch(data=,group=,id=, mvars=,wts=,dmaxk=,dmax=,transf, time=, dist=, ncontls=,seedca=,seedco=, out=,outnmca=,outnmco=,print=);

Parameter definitions(R=required parameter):

R data SAS data set containing cases and potential controls. Must contain the ID, GROUP, and the matching variables.

R group SAS variable defining cases. Group=1 if case, 0 if control.

R id SAS CHARACTER ID variable for the cases and controls.

R mvars List of numeric matching variables common to both case and control data sets. For example, mvars=male age birthyr.

R wts List of non-negative weights corresponding to each matching variable. For example wts=10 2 1 corresponding to male, age and birthyr as in the above example.

dmaxk List of non-negative values corresponding to each matching variable. These numbers are the largest possible absolute differences compatible with a valid match. Cases will NOT be matched to a control if ANY of the INDIVIDUAL matching factor differences are >DMAXK. This optional parameter allows one to form matches of the type male+/-0, age+/-2, birth year+/-5 by specifying DMAXK=0 2 5.

dmax Largest value of Dij considered to be a valid match. If you want to match exactly on a two-level factor(such as gender coded as 0 or 1) then assign DMAX to be less than the weight for the factor. In the example above, one could use wt=10 for male and dmax=9. Leave DMAX blank if any Dij is a valid match. One would typically NOT use both DMAXK and DMAX. The only advantage to using both, would be to further restrict potential matches that meet the DMAXK criteria.

dist Indicates type of distance to calculate.

1=weighted sum(over matching vars) of absolute case-control differences(default)

2=weighted Euclidean distance

time Time variable used for risk set matching. Matches are only valid if the control time > case time. May need to

transf Indicates whether all matching vars are to be transformed (using the combined case+control data) prior to computing distances. 0=no(default), 1=standardize to mean 0 and variance 1, 2=use ranks of matching variables.

ncontls Indicates the number of controls to match to each case. The default is 1. With multiple controls per case, the algorithm will first match every case to one control and then again match each case to a second control, etc. Controls selected on the first pass will be stronger matches than those selected in later rounds. The output data set contains a variable (cont_n) which indicates on which round the control was selected.

seedca Seed value used to randomly sort the cases prior to matching. This positive integer will be used as input to the RANUNI function. The greedy matching algorithm is order dependent which, among other things means that cases matched first will be on average more similar to their controls than those matched last(as the number of control choices will be limited). If the matching order is related to confounding factors (possibly age or calendar time) then biases may result. Therefore it is generally considered good practice when using the GREEDY method to randomly sort both the cases and controls before beginning the matching process.

seedco Seed value used to randomly sort the controls prior to matching using the GREEDY method. This seed value must also be a positive integer.

print= Option to print data for matched cases. Use PRINT=y to print data and PRINT=n or blank to not print. Default is y.

out=name of SAS data set containing the results of the matching process. Unmatched cases are not included. See outnm below. The default name is __out. This data set will have the following layout:

Case_id Cont_id Cont_n Dij Delta_caco MVARS_ca MVARS_co 1 67 1 5.2 (Differences & actual 1 78 2 6.1 values for matching factors 2 52 1 2.9 for cases & controls) 2 92 2 3.1 . . . . . . . .

outnmca=name of SAS data set containing NON-matched cases. Default name is __nmca .

outnmco=name of SAS data set containing NON-matched controls. Default name is __nmco .

References: Bergstralh, EJ and Kosanke JL(1995). Computerized matching of controls. Section of Biostatistics Technical Report 56. Mayo Foundation.

Example: 1-1 matching by male(exact), age(+-2) and year(+-5). The wt for male is not relevant, as only exact matches on male will be considered. The weight for age(2) is double that for year(1).

%gmatch(data=all, group=ca_co,id=clinic, mvars=male age_od yr_od, wts=2 2 1, dmaxk=0 2 5,out=mtch, seedca=87877,seedco=987973);

This software is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version.

This software is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. *************************************************************** */

%MACRO GMATCH(DATA=,GROUP=,ID=, MVARS=,WTS=,DMAXK=,DMAX=,DIST=1, NCONTLS=1, TIME=,TRANSF=0, SEEDCA=,SEEDCO=,PRINT=y, OUT=__OUT,OUTNMCA=__NMCA,OUTNMCO=__NMCO);

%IF %LENGTH(&DATA)=0 %THEN %DO; %PUT ERROR: NO DATASET SUPPLIED; %LET BAD=1; %END;

%IF %LENGTH(&ID)=0 %THEN %DO; %PUT ERROR: NO ID VARIABLE SUPPLIED; %LET BAD=1; %END;

%IF %LENGTH(&GROUP)=0 %THEN %DO; %PUT ERROR: NO CASE(1)/CONTROL(0) GROUP VARIABLE SUPPLIED; %LET BAD=1; %END;

%IF %LENGTH(&MVARS)=0 %THEN %DO; %PUT ERROR: NO MATCHING VARIABLES SUPPLIED; %LET BAD=1; %END;

%IF %LENGTH(&WTS)=0 %THEN %DO; %PUT ERROR: NO WEIGHTS SUPPLIED; %LET BAD=1; %END;

%LET NVAR=0; %DO %UNTIL(%SCAN(&MVARS,&NVAR+1,' ')= ); %LET NVAR=%EVAL(&NVAR+1); %END; %LET NWTS=0; %DO %UNTIL(%QSCAN(&WTS,&NWTS+1,' ')= ); %LET NWTS=%EVAL(&NWTS+1); %END; %IF &NVAR^= &NWTS %THEN %DO; %PUT ERROR: #VARS MUST EQUAL #WTS; %LET BAD=1; %END;

%LET NK=0; %IF %QUOTE(&DMAXK)^= %THEN %DO %UNTIL(%QSCAN(&DMAXK,&NK+1,' ')= ); %LET NK=%EVAL(&NK+1); %END; %IF &NK>&NVAR %THEN %LET NK=&NVAR; %DO I=1 %TO &NVAR; %LET V&I=%SCAN(&MVARS,&I,' '); %END;

%IF &NWTS>0 %THEN %DO; DATA _NULL_; %DO I=1 %TO &NWTS; %LET W&I=%SCAN(&WTS,&I,' '); IF &&W&I<0 THEN DO; PUT 'ERROR: WEIGHTS MUST BE NON-NEGATIVE'; CALL SYMPUT('BAD','1'); END; %END; RUN; %END;

%IF &NK>0 %THEN %DO; DATA _NULL_; %DO I=1 %TO &NK; %LET K&I=%SCAN(&DMAXK,&I,' '); IF &&K&I<0 THEN DO; PUT 'ERROR: DMAXK VALUES MUST BE NON-NEGATIVE'; CALL SYMPUT('BAD','1'); END; %END; RUN; %END;

%MACRO MAX1; %IF &DMAX^= %THEN %DO; & __D<=&DMAX %END; %DO I=1 %TO &NK; & ABS(__CA&I-__CO&I)<=&&K&I %END; %MEND MAX1;

data __CHECK; set &DATA; __id=&id; if __id="" then delete; %DO I=1 %TO &NVAR; IF %scan(&mvars,&i)=. THEN DELETE; %END; %IF &TIME^= %THEN %DO; IF &TIME=. THEN DELETE; %END; run;

*** transform data if requested/separate cases & controls; %if &transf=1 %then %do; proc standard data=__check m=0 s=1 out=_stdzd; var &mvars; data _caco; set _stdzd; %end;

%if &transf=2 %then %do; proc rank data=__check out=_ranks; var &mvars; data _caco; set _ranks; %end;

%if &transf=0 %then %do; data _caco; set __check; %end;

DATA __CASE; SET _caco; if &group=1; DATA __CASE; SET __CASE END=EOF; KEEP __IDCA __CA1-__CA&NVAR __R &mvars %if &time^= %then %do; __catime %end; ; __IDCA=&ID; %if &time^= %then %do; __catime=&time; %end; %DO I=1 %TO &NVAR; __CA&I=&&V&I; %END; %if &seedca^= %then %do; SEED=&SEEDCA; __R=RANUNI( SEED ); %end; %else %do; __R=1; %end;

IF EOF THEN CALL SYMPUT('NCA',_N_); PROC SORT; BY __R __IDCA;

DATA __CONT; SET _caco; if &group=0; DATA __CONT; SET __CONT END=EOF; KEEP __IDCO __CO1-__CO&NVAR __R &mvars %if &time^= %then %do; __cotime %end; ; __IDCO=&ID; %if &time^= %then %do; __cotime=&time; %end; %DO I=1 %TO &NVAR; __CO&I=&&V&I; %END; %if &seedco^= %then %do; SEED=&SEEDCo; __R=RANUNI( SEED ); %end; %else %do; __R=1; %end;

IF EOF THEN CALL SYMPUT('NCO',_N_); RUN; %LET BAD2=0; %IF &NCO < %EVAL(&NCA*&NCONTLS) %THEN %DO; %PUT ERROR: NOT ENOUGH CONTROLS TO MAKE REQUESTED MATCHES; %LET BAD2=1; %END;

%IF &BAD2=0 %THEN %DO; PROC SORT; BY __R __IDCO; DATA __MATCH; KEEP __IDCA __CA1-__CA&NVAR __DIJ __MATCH __CONT_N %if &time^= %then %do; __catime __cotime %end; ; ARRAY __USED(&NCO) \$ 1 _TEMPORARY_; DO __I=1 TO &NCO; __USED(__I)='0'; END; DO __I=1 TO &NCONTLS; DO __J=1 TO &NCA; SET __CASE POINT=__J; __SMALL=.; __MATCH=.; DO __K=1 TO &NCO; IF __USED(__K)='0' THEN DO; SET __CONT POINT=__K;

%if &dist=2 %then %do; **wtd euclidian dist; __D= sqrt( %do k=1 %to &nvar; %scan(&wts,&k)*(__ca&k - __co&k)**2 %if &k<&nvar %then + ; %end; ); %end; %else %do; **wtd sum absolute diff; __D= %do k=1 %to &nvar; %scan(&wts,&k)*abs(__ca&k - __co&k ) %if &k<&nvar %then + ; %end; ; %end;

IF __d^=. & (__SMALL=. | __D<__SMALL) %MAX1 %if &time^= %then %do; & __cotime > __catime %end; THEN DO; __SMALL=__D; __MATCH=__K; __DIJ=__D; __CONT_N=__I; END; END; END; IF __MATCH^=. THEN DO; __USED(__MATCH)='1'; OUTPUT; END; END; END; STOP; DATA &OUT; SET __MATCH; SET __CONT POINT=__MATCH; KEEP __IDCA __IDCO __CONT_N __DIJ __CA1-__CA&NVAR __CO1-__CO&NVAR __d1-__d&nvar __absd1-__absd&nvar __WT1-__WT&NVAR __catime __cotime __dtime;

%if &time= %then %do; __cotime=.; __catime=.; %end; LABEL __catime="&time/CASE" __cotime="&time/CONTROL" __dtime="&time/ABS. DIFF" __CONT_N='CONTROL/NUMBER' __DIJ='DISTANCE/D_IJ' %DO I=1 %TO &NVAR; __CA&I="&&V&I/CASE" __CO&I="&&V&I/CONTROL" __absd&I="&&V&I/ABS. DIFF " __d&I="&&V&I/DIFF " __WT&I="&&V&I/WEIGHT" %END; ; %DO I=1 %TO &NVAR; __d&i= (__CA&I-__CO&I); **raw diff; __absd&I=abs(__CA&I-__CO&I); **abs diff; __WT&I=&&W&I; %END; __dtime=__cotime-__catime;

PROC SORT DATA=&OUT; BY __IDCA __CONT_N; proc sort data=__case; by __IDCA; data &outnmca; merge __case &out(in=__inout where=(__cont_n=1)); by __idca; if __inout=0; **non-matches;

proc sort data=__cont; by __IDCO; proc sort data=&out; by __IDCO; data &outnmco; merge __cont &out(in=__inout); by __idco; if __inout=0; **non-matched controls; proc sort data=&out; by __IDCA; **re-sort by case id;

%if %upcase(&print)=Y %then %do; PROC PRINT data=&out LABEL SPLIT='/'; VAR __IDCA __IDCO __CONT_N

__DIJ %DO I=1 %TO &NVAR; __absd&I %END; %if &time^= %then %do; __dtime %end; %DO I=1 %TO &NVAR; __CA&I __CO&I %END; %if &time^= %then %do; __catime __cotime %end; ; sum __dij;

title9'Data listing for matched cases and controls'; footnote"Greedy matching(gmatch) macro: data=&data group=&group id=&id "; footnote2" mvars=&mvars wts=&wts dmaxk=&dmaxk dmax=&dmax ncontls=&ncontls"; footnote3" transf=&transf dist=&dist time=&time seedca=&seedca seedco=&seedco"; footnote4" out=&out outnmca=&outnmca outnmco=&outnmco"; run; title9'Summary data for matched cases and controls--one obs/control'; %if &sysver ge 8 %then %do; proc means data=&out maxdec=3 fw=8 n mean median min p10 p25 p75 p90 max sum; %end; %else %do; proc means data=&out maxdec=3 n mean min max sum; %end; class __cont_n; var __dij

%do I=1 %TO &NVAR; __absd&I %end; %if &time^= %then %do; __dtime %end; %do I=1 %TO &NVAR; __ca&I %end; %if &time^= %then %do; __catime %end; %do I=1 %TO &NVAR; __co&I %end; %if &time^= %then %do; __cotime %end; ; run; *** estimate matching var means within matched sets for controls; proc means data=&out n mean noprint; by __idca; var __dij %do i=1 %to &nvar; __co&i %end; __cotime ; output out=_mcont n=n_co mean=__dijm %do i=1 %to &nvar; __com&i %end; __tcom ; data _onecase; set &out; by __idca; if first.__idca; data __camcon; merge _onecase _mcont; by __idca;

keep __idca n_co __dijm __dtime __catime __tcom %do i=1 %to &nvar; __ca&i __com&i __actd&i __absd&i %end; ;

%do i=1 %to &nvar; __absd&i=abs(__ca&i - __com&i); __actd&i=(__ca&i - __com&i); %end; __dtime=__tcom-__catime ;

label n_co="No./CONTROLS" __dijm="Average/Dij" __dtime="&time/Mean Time DIFF" __tcom="&time/Mean CONT TIME"

%do i=1 %to &nvar; %let vvar=%scan(&mvars,&i); __absd&i="&vvar/Mean ABS. DIFF" __com&i="&vvar/Mean CONTROL" %end; ; title9'Summary data for matched cases and controls--one obs/case(using average control value)'; %if &sysver ge 8 %then %do; proc means data=__camcon maxdec=3 fw=8 n mean median min p10 p25 p75 p90 max sum; %end; %else %do; proc means data=__camcon maxdec=3 n mean min max sum; %end; var n_co __dijm %do i=1 %to &nvar; __absd&i %end; %if &time^= %then %do; __dtime %end; %do i=1 %to &nvar; __ca&i %end; %if &time^= %then %do; __catime %end; %do i=1 %to &nvar; __com&i %end; %if &time^= %then %do; __tcom %end; ; %end; **end of print=y loop**; %END; **end of bad2=0 loop**; run; title9; footnote; run;

%mend greedy;

%IF &BAD=0 %THEN %DO; %GREEDY %END; %MEND GMATCH;

**test data; DATA FAKEREG; DO I = 1 TO 3000; *id = _n_; id=i; SEX = MOD(I, 2); IF mod(I, 21) THEN CASE = 0; ELSE CASE = 1; *AGE = (INT(RANUNI(12378937)*10000)/100); age= int(ranuni(123789837)*100); DROP i ; OUTPUT; END;

RUN; data fakereg;set fakereg; timex=5; if case=0 then timex=6; proc print data=fakereg;run;

%gmatch(data=fakereg,group=case, id=id, mvars=age sex,wts=2 1,dmaxk= 5 0, transf=0, time=timex, dist=1, ncontls=2,seedca=234098,seedco=0489, out=regccout,outnmco=matched,print=Y); run;

stephensed@gmail.com schrieb:

> I'm a relatively new SAS user, and trying to use a macro in SAS for the > first time. The macro I'd like to use is the gmatch macro found at > http://cancercenter.mayo.edu/mayo/research/biostat/upload/gmatch.sas, > which I'd use to complete a 1:1 case-control match. > > The problem I encounter is that when I use the basic structure of the > macro (with my specifications) as follows: > > %MACRO GMATCH(DATA=,GROUP=,ID=, > MVARS=,WTS=,DMAXK=,DMAX=,DIST=1, > NCONTLS=1, TIME=,TRANSF=0, > SEEDCA=,SEEDCO=,PRINT=y, > OUT=__OUT,OUTNMCA=__NMCA,OUTNMCO=__NMCO); > run; > > I get a warning from SAS, saying that I'm "missing %MEND statement", > which prevents the completion of the program. So I tried adding > "%mend;" to the end of the program in several ways (with "run;" > following or preceding the "%mend;", with "end;" or "%end;" preceding > the "%mend;", and just about all computations of these commands). Even > when I include "%mend;" at the end of the program, I still get the same > warning message that I'm "missing %MEND statement." > > Any idea what I'm doing wrong? Or suggestions how to run the program > successfully? > > Best regards, > Lara Heflin

Back to: Top of message | Previous page | Main SAS-L page