Date: Wed, 22 Aug 2007 14:04:44 -0500
Reply-To: Tom White <tw2@MAIL.COM>
Sender: "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From: Tom White <tw2@MAIL.COM>
Subject: SAS data step compute COUNTs and Percentages
Content-Type: text/plain; charset="iso-8859-1"
Hello SAS-L,
I have a data set containing about 7M records (and about 30 more fields besides the two fields, ID & CODE) I show here.
ID is one of the fields in this data set. It stands for a patient ID.
It is a character field of length 1.
CODE is another 5-digit character field. It stands for a medical procedure.
This field takes on thousands of values. There is, however, a subset of these
values I am interested in. This subset contains the CODE values
NULL 59510 J2505 59400 J1745 J3490 27447
A0431 FACIL J9355 J9310 E0784
Here is sample dataset containing 34 records.
ID CODE
1 J3490
1 NULL
1 36415
1 88142
1 59510
2 99213
2 80053
2 J2505
3 FACIL
3 J3490
4 99232
5 80053
5 E0784
5 E0784
5 J9310
6 A0431
7 59400
7 59400
8 NULL
8 NULL
8 NULL
8 NULL
9 27447
9 J9355
10 E0784
10 NULL
10 E0784
10 NULL
10 J2505
10 59510
10 NULL
10 NULL
10 E0784
10 59510
I would like to create additional fields to augment the above data set for modeling.
*****************************************************************************************************
I would like to say in words what I am trying to compute.
Maybe you can find an easier way to get the results I show below.
For each patient (ID) in my sample dataset, create new data columnns to compute
the total number of medical services (CODE) they had within any of the top medical services (CODEs)
shown below:
NULL 59510 J2505 59400 J1745 J3490 27447
A0431 FACIL J9355 J9310 E0784
The above CODEs are the medical procedures, out of many thousands of medical procedures,
I have determined to be of interest to me.
Therefore I would like to know, given the thousands of procedures, and given my 12 proceduers shwon above,
the total number of procedures (CODEs) a patientb (ID) had within any of the 12 proceduers shown above.
I would like to create counts and percentages.
*****************************************************************************************************
(1) I would like to know, for each ID, how many CODEs are found in
the subset of codes mentioned above (I repeat them here):
NULL 59510 J2505 59400 J1745 J3490 27447
A0431 FACIL J9355 J9310 E0784
For example, create a new field, call it, say, SUM_COUNT, to count
the total number of codes per ID:
ID CODE SUM_COUNT
1 J3490 1
1 NULL 1
1 36415 0
1 88142 0
1 59510 1
2 99213 0
2 80053 0
2 J2505 1
3 FACIL 1
3 J3490 1
4 99232 0
5 80053 0
5 E0784 2
5 E0784 2
5 J9310 1
6 A0431 1
7 59400 2
7 59400 2
8 NULL 4
8 NULL 4
8 NULL 4
8 NULL 4
9 27447 1
9 J9355 1
10 E0784 3
10 NULL 4
10 E0784 3
10 NULL 4
10 J2505 1
10 59510 2
10 NULL 4
10 NULL 4
10 E0784 3
10 59510 2
(2) Now that I have counts, I would like to compute WITHIN ID percentages.
For example, within ID=10, we have 10 records. Of those 10 records, we have:
NULL=4 records or 4/10=40%
E0784=3 records or 3/10=30%
J2505=1 records or 1/10=10%
59510=2 records or 2/10=20%
My data set now will look like:
ID CODE SUM_COUNT WITHIN_PCT
1 J3490 1 0.20
1 NULL 1 0.20
1 36415 0 0.00
1 88142 0 0.00
1 59510 1 0.20
2 99213 0 0.00
2 80053 0 0.00
2 J2505 1 0.33
3 FACIL 1 0.50
3 J3490 1 0.50
4 99232 0 0.00
5 80053 0 0.00
5 E0784 2 0.50
5 E0784 2 0.50
5 J9310 1 0.25
6 A0431 1 1.00
7 59400 2 1.00
7 59400 2 1.00
8 NULL 4 1.00
8 NULL 4 1.00
8 NULL 4 1.00
8 NULL 4 1.00
9 27447 1 0.50
9 J9355 1 0.50
10 E0784 3 0.30
10 NULL 4 0.40
10 E0784 3 0.30
10 NULL 4 0.40
10 J2505 1 0.10
10 59510 2 0.20
10 NULL 4 0.40
10 NULL 4 0.40
10 E0784 3 0.30
10 59510 2 0.20
(3) Finally, I woul like to have an OVERALL percentage.
For example, lets take ID=1. (We have a TOTAL of 34 records.)
I would like a new field, call it, OVERALL_PCT.
For ID=1 and CODE='J3490' we will have SUM_COUNT/34 = 1/34 = 0.02941 (or 2.94%).
For ID=1 and CODE='NULL' we will have SUM_COUNT/34 = 1/34 = 0.02941 (or 2.94%).
For ID=1 and CODE='36415' we will have SUM_COUNT/34 = 0/34 = 0.00000 (or 0.00%).
For ID=1 and CODE='88142' we will have SUM_COUNT/34 = 0/34 = 0.00000 (or 0.0%).
For ID=1 and CODE='59510' we will have SUM_COUNT/34 = 1/34 = 0.02941 (or 2.94%).
ID CODE SUM_COUNT OVERALL_PCT
1 J3490 1 0.02941
1 NULL 1 0.02941
1 36415 0 0.00000
1 88142 0 0.00000
1 59510 1 0.02941
Therefore, my final desired data set will look like:
ID CODE SUM_COUNT WITHIN_PCT OVERALL_PCT
1 J3490 1 0.20 0.02941
1 NULL 1 0.20 0.02941
1 36415 0 0.00 0.00000
1 88142 0 0.00 0.00000
1 59510 1 0.20 0.02941
2 99213 0 0.00 0.00000
2 80053 0 0.00 0.00000
2 J2505 1 0.33 0.02941
3 FACIL 1 0.50 0.02941
3 J3490 1 0.50 0.02941
4 99232 0 0.00 0.00000
5 80053 0 0.00 0.00000
5 E0784 2 0.50 0.05882
5 E0784 2 0.50 0.05882
5 J9310 1 0.25 0.02941
6 A0431 1 1.00 0.02941
7 59400 2 1.00 0.05882
7 59400 2 1.00 0.05882
8 NULL 4 1.00 0.12500
8 NULL 4 1.00 0.12500
8 NULL 4 1.00 0.12500
8 NULL 4 1.00 0.12500
9 27447 1 0.50 0.02941
9 J9355 1 0.50 0.02941
10 E0784 3 0.30 0.08823
10 NULL 4 0.40 0.11765
10 E0784 3 0.30 0.08823
10 NULL 4 0.40 0.11765
10 J2505 1 0.10 0.02941
10 59510 2 0.20 0.05882
10 NULL 4 0.40 0.11765
10 NULL 4 0.40 0.11765
10 E0784 3 0.30 0.08823
10 59510 2 0.20 0.05882
Thank so much!!!
tom
--
We've Got Your Name at http://www.mail.com !
Get a FREE E-mail Account Today - Choose From 100+ Domains