Date: Tue, 15 Dec 2009 12:55:09 -0500
Reply-To: Jonathan Goldberg <jgoldberg@BIOMEDSYS.COM>
Sender: "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From: Jonathan Goldberg <jgoldberg@BIOMEDSYS.COM>
Subject: Re: Creating Dummy Variables from Values of Variables
(reformatted to be more readable,
I HATE Outlook's formatting of my messages)
This task can be done in one pass of the patient data by a data step using
a hash table and an array. Also needed: a preprocess of the the TopX data
set to add an index.
data TopX;
set Topx(keep = code);
index + 1;
run;
proc sql noprint;
select cats("dummy" code) into :codes separated by " " from topx;
quit;
data want;
array codes(*) &codes;
array dxs(*) dx1 - dx9; /*you could use macro here to allow*/
/*for a varying number of dxs*/
if _n_ = 1 then do;
if 0 then set TopX;
dcl hash codetable(dataset: "TopX");
codetable.definekey("code");
codetable.definedata("index");
codetable.definedone();
end;
do i = 1 to dim(codes); /Initialize dummys to zero*/
code(i) = 0;
end;
set have;
do i = 1 = dim(dxs);
code = dxs(i);
rc = codetable.find(); /*found a top whatever dx?*/
if rc = 0 then codes(index) = 1; /*set it to 1*/
end;
run;
The proc sql step is not needed for a single set of codes; you could just
hand code them into the array. The codes just need to be in the same
order they are in the data set TopX so the index variable points to the
right array member.
Cheers,
Jonathan
On Mon, 14 Dec 2009 23:32:20 -0500, Kirby, Ted <ted.kirby@LEWIN.COM> wrote:
>We have a dataset that has up to nine ICD-9 diagnosis codes in each
record. We want to identify the top 50 (or 100 or 250) ICD-9 and
construct dummy variables for the existence of those codes.
>
>Thanks to Dan Nordlund's reply to Trish Bous (at
http://groups.google.com/group/comp.soft-
sys.sas/browse_thread/thread/c82586e7f1164e54/ec5fc63a3eb16d9b?
lnk=gst&q=proc+freq+order) we have the first part done (the identification
of the top x codes and have saved them in a dataset. The first four
observations of this dataset can be created with:
>
>data TopX;
>input code $5. count;
>datalines;
>5990 36163
>4019 34429
>2639 31573
>5188126200
>;
>run;
>
>How do we create the appropriate dummy variables? We would like to add
variables to the dataset named dummyxxxxx for each of the top diagnosis
codes that would have a value of 1 if the appropriate ICD-9 Diagnosis Code
appears in any of the diagnosis code fields. We would use the TopX
dataset to construct the names of the dummy variables, but setting each of
the dummy variables would have to be done in the main dataset. For
example, in the following dataset:
>
>data have;
>input patid (Dx1-Dx9) ($5);
>datalines;
> 1 51881 5990 8052 85201 V550 V5419 73300 57420 5932
> 2 49121 51881 486 V850 41401 4019 7837 V462
> 3 51884 33520 34400 5070 5601 5990 V4611 33522 51909
> 4 03811 2639 2793 73021 V090 99591 07070 2384 311
> 5 56961 486 1962 4019 2749 27801 3051 V1005 78606
> 6 99832 73026 V4586 04111 496 42731 27800 V1259 71590
>;
>run;
>
>We would want to add the dummy variables with the following values:
>
>PatID Dummy5990 Dummy4019 Dummy2639 Dummy51881
> 1 1 0 0 1
> 2 1 0 0 1
> 3 1 0 0 0
> 4 0 0 1 0
> 5 0 1 0 0
> 6 0 0 0 0
>
>Thanks for any assistance you can render.
>
>************* IMPORTANT - PLEASE READ ********************
>
>This e-mail, including attachments, may include confidential and/or
proprietary information, and may be used only by the person or entity to
which it is addressed. If the reader of this e-mail is not the intended
recipient or his or her authorized agent, the reader is hereby notified
that any dissemination, distribution or copying of this e-mail is
prohibited. If you have received this e-mail in error, please notify the
sender by replying to this message and delete this e-mail immediately.
>