Date: Fri, 18 Dec 1998 11:54:27 -0500
Reply-To: pdorfma@FL6612MAILEX4.UCS.ATT.COM
Sender: "SAS(r) Discussion" <SAS-L@UGA.CC.UGA.EDU>
From: pdorfma@FL6612MAILEX4.UCS.ATT.COM
Subject: Re: Sort variables within an observation
Content-Type: text/plain; charset="iso-8859-1"
Peter Flom <peter.flom@NDRI.ORG>, in part, wrote:
>I have a data set with several hundred observations. Each observation
>contains (among a lot of other stuff) 5 variable corresponding to the age
>at which the subject first did something. Each of these could be missing,
>or range from 1 to 25.
>The "somethings" are various drugs: Marijuana, cocaine, etc.
>What I would like is, for each person, to get a data set containing which
>drug the person did first, second, third, fourth, or fifth. There are a
>couple complications. Each person could have done any, some, or all of
>the drugs. They could do them in any order. And they could have started
>doing two (or more) at the same age, thus yielding ties.
>I could code this with "brute force", but that would take a couple hundred
>lines of code.
>Does anyone have a simple or elegant solution?
Peter,
A very similar problem was once discussed in the thread "Ordering word
tokens" originated by a question posted by Robert Lokhamp on October 10,
1998. It was shown that, even though the task can be converted to making use
of PROC SORT, the most efficient solution boils down to exactly what you
stated in the title, that is, to using an explicitly coded sorting routine
to order the variables in every observation. With only 5 variables to sort,
there is no need in a sophisticated algorithm; simple sorting schemes, for
instance, straight insertion sort, will run just as fast. First, you have
to organize an array, say, D(*), incorporating the variables you need to
sort; second, use insertion sort to order them. However, your problem has an
additional twist in that you need to output 5 different variables holding
the NAMES of the variables whose values have been enumerated as a result of
sorting. Therefore, in addition to the first array, we shall create two
extra arrays. One extra array, let us call it Z(5) _TEMPORARY_, will contain
the enumeration of the variables in the array D(*), and in the process of
sorting, we shall move the items in the array D (being actually sorted) and
the elements in Z providing the enumeration around synchronously. The second
extra array, S(5), will house the 5 new variables to be populated with the
variable names from the array D(*) according to the order of first drug
usage. The enumerating numbers rearranged along with the elements of D(*)
will act as pointers telling us exactly which nodes in D(*) the names should
come from.
Assume, for simplicity, that we have only 10 observations with some drug
data in the range as you indicated, and some extra variables standing for "a
lot of other stuff". The situation could be simulated using the following
DATA step:
DATA DRUGS (DROP=I J);
ARRAY D (*) MARI COCA HERO LSD OPIUM;
DROP I J;
DO I=1 TO 10;
DO J=LBOUND(D) TO HBOUND(D);
D(J) = INT(RANUNI(1)*25) - 1;
IF D(J) LE 0 THEN D(J) = .;
END;
OTHER = CEIL(RANUNI(2)*10);
STUFF = CEIL(RANUNI(3)*10);
OUTPUT;
END;
RUN;
Printed, the dataset looks like that:
OBS MARI COCA HERO LCD OPIUM OTHER STUFF
1 3 23 8 5 22 10 6
2 12 . . 19 12 9 1
3 22 6 5 16 23 3 7
4 9 12 6 10 20 7 6
5 13 8 17 11 22 10 6
6 6 8 10 15 3 2 9
7 6 22 21 13 . 2 6
8 9 3 15 9 2 5 2
9 13 17 9 . 12 4 1
10 16 22 10 22 16 2 2
Now, we can translate the plan outlined above into the SAS Language:
DATA USAGE (KEEP=FIRST SECOND THIRD FOURH FIFTH);
ARRAY D(*) MARI COCA HERO LSD OPIUM;
ARRAY Z(5) _TEMPORARY_;
ARRAY S(*) $8 FIRST SECOND THIRD FOURH FIFTH;
SET DRUGS;
*** Enumerate variables in D(*);
DO I=1 TO DIM(Z); Z(I) = I; END;
*** Insertion-sort D(*) and move Z-items along;
DO J=LBOUND(D)+1 TO HBOUND(D);
TD = D(J); TN = Z(J);
DO I=J-1 TO 1 BY -1;
IF TD => D(I) THEN LEAVE;
D(I+1) = D(I); Z(I+1) = Z(I);
END;
D(I+1) = TD; Z(I+1) = TN;
END;
*** Use Z-items as pointers to assign names;
DO I=1 TO 5;
N = Z(I);
CALL VNAME(D(N),S(I));
END;
RUN;
Which yields:
OBS FIRST SECOND THIRD FOURTH FIFTH
1 MARI LSD HERO OPIUM COCA
2 COCA HERO MARI OPIUM LSD
3 HERO COCA LSD MARI OPIUM
4 HERO MARI LSD COCA OPIUM
5 COCA LSD MARI HERO OPIUM
6 OPIUM MARI COCA HERO LSD
7 OPIUM MARI LSD HERO COCA
8 OPIUM COCA MARI LSD HERO
9 LSD HERO OPIUM MARI COCA
10 HERO MARI OPIUM COCA LSD
One friendly warning: DO NOT try to shorten the code by using a nested array
reference D(Z(I)) inside the VNAME routine unless you really want to spend a
day figuring out why "Array subscript out of range at line..." whilst it is
absolutely, positively within the range.
Have a happy holiday season!
Kind regards,
Paul
++++++++++++++++++++++++
Paul M. Dorfman
Citibank UCS
Decision Support Systems
Jacksonville, FL
++++++++++++++++++++++++
|