Date: Mon, 30 Apr 2001 10:54:08 +0000
Reply-To: "Dr. Hans-Christian Waldmann" <waldmann@SAMSON.FIRE.UNI-BREMEN.DE>
Sender: "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From: "Dr. Hans-Christian Waldmann" <waldmann@SAMSON.FIRE.UNI-BREMEN.DE>
Organization: University of Bremen
I would like to invite opinions pertaining to the following problem:
We are trying to evaluate some treatment using a standard pre-post
design with multiple control-groups (4 groups total). Upon arrival
patients are randomized into these groups. All's fine. But patients
have been assigned different diagnoses (while being eligible for the
same treatment) and we need to control for possible treatment*diagnoses
interactions. A solution would be matching the patients after rando-
mization with respect to diagnoses, without any repartionning of the
forerly selected groups. I have written a SAS macro to this end that
implements the following algorithm. What I would like to know: does
this corrupt the idea of control by randomization or is this sort
of two-stage-sampling a sound strategy ?
A master data set has several variables, among these:
- ID (of patient
- Match (Diagnose, numeric for convience
Take all of the N subjects from the master data set, and allot them
into G groups of equal size k=N/G using the usual "sort-by-
random-uniform(0)-and-take-first-k-Algorithm". Delete those assigned
from the pool and repeat until all groups have been filled.
Now, for each group, count the occurence of each value of the matching
variable. From the resulting array group*value_$ (cell=count of the value)
select the minimum frequency into an auxilliary variable.
Reduce all groups by random case-deletion to have this least common count
for the particular value of the matching variable. Do it for all values of
the matching variable, and put the pieces together (append the set for
Now we have G groups with exactly the same distribution of the matching
variable within groups and equal sample size across groups and append these
oens to give the final set.
Even if the master data set had been completely balanced with respect
to the matching variable, randomization is almost shure to imbalance
the values of the matching variable within the newly sampled groups.
Since we deleted some patients from the g-1 groups having more patients
with a particular value of the matching variable in the previous step,
there will be some loss form the original master data set to the final
one. The idea is to put this set of patients in another "input"-set and
iterate the whole procedure. The result set of the second loop is
appended to the first. If the loss does not exceed a specified proportion
or if the variance of the matching variable is to poor to build interim
datasets in step 2, quit: else iterate again.
I have run this with a master dataset of 624 Patients, to be split in
g=4 groups of k=156 each in the first run. The matching variable had
4 distinct values.
The first result set came up with 544 out of 624 patients, with
136 patients in each group and the matching variable balanced.
The procedures iterated with the 80 patients not in the result
set, randomized, matched, and augmentend the final set with another
52 people (13 persons holding the same out of 4 values of the
matching variable). So we got 596 out of 624, rnadomized with
respect to anything unknown and matched with respect to the
Lots of other test runs yield comparable results, and the final
sets passed all tests (duplicated IDs' ? same sizes ?, same
count for matching variables ? etc.)
So I know it works. But is it right (conceptually)?? Or, if it is,
is it trivial (in the sense that it's unneccessary or could be done
Any comments / hints / criticisms welcome !!
PD Dr. Hans C Waldmann
Methodology & Applied Statistics in Psychology & the Health Sciences
ZFRF / University of Bremen / Grazer Str 6 / 28359 Bremen / Germany
friend of: AIX PERL POSTGRES ADABAS SAS TEX