Date: Wed, 22 Aug 2007 16:15:42 -0500
Reply-To: Mary <mlhoward@avalon.net>
Sender: "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From: Mary <mlhoward@AVALON.NET>
Subject: Re: SAS data step compute COUNTs and Percentages
Content-Type: text/plain; charset="iso-8859-1"
Hi Tom,
I've played around with some code that would get your data in the format that you want it- before percentages. Here it is below.
Mary Howard
Research Assistant III
Dept. of Ophthalmology
Univ. of Iowa Hospitals and Clinics
data set1;
informat diag $10.;
input id diag $;
cards;
1 heart
1 heart
1 diabetes
2 eye
3 hip
3 hip
4 diabetes
4 diabetes
4 diabetes
;
run;
proc print data=set1;
run;
proc sql;
create table set1a as
select distinct id from
set1;
data set2a;
informat diag $10.;
set set1a;
diag='heart';
run;
data set2b;
informat diag $10.;
set set1a;
diag='diabetes';
run;
data set2c;
informat diag $10.;
set set1a;
diag='eye';
run;
data set2;
set set2a set2b set2c;
run;
proc print data=set2 noobs;
var id diag;
run;
proc sql;
create table set3 as
select id, diag, count(diag) as count
from set1
group by id,diag
order by id, diag;
quit;
run;
proc print data=set3 noobs;
run;
proc sort data=set2;
by id diag;
proc sort data=set3;
by id diag;
proc sql;
create table set4 as
select set2.id as id, set2.diag as diag, set3.count as count
from set2
left outer join set3
on set2.id=set3.id and
set2.diag=set3.diag
order by set2.id,set2.diag;
quit;
run;
data set4a;
set set4;
if count=. then count=0;
proc print data=set4a noobs;
var id diag count;
run;
----- Original Message -----
From: "Tom White" <tw2@MAIL.COM>
To: <SAS-L@LISTSERV.UGA.EDU>
Sent: Wednesday, August 22, 2007 2:04 PM
Subject: SAS data step compute COUNTs and Percentages
> Hello SAS-L,
>
> I have a data set containing about 7M records (and about 30 more fields besides the two fields, ID & CODE) I show here.
>
> ID is one of the fields in this data set. It stands for a patient ID.
> It is a character field of length 1.
>
> CODE is another 5-digit character field. It stands for a medical procedure.
> This field takes on thousands of values. There is, however, a subset of these
> values I am interested in. This subset contains the CODE values
>
> NULL 59510 J2505 59400 J1745 J3490 27447
> A0431 FACIL J9355 J9310 E0784
>
> Here is sample dataset containing 34 records.
>
> ID CODE
> 1 J3490
> 1 NULL
> 1 36415
> 1 88142
> 1 59510
> 2 99213
> 2 80053
> 2 J2505
> 3 FACIL
> 3 J3490
> 4 99232
> 5 80053
> 5 E0784
> 5 E0784
> 5 J9310
> 6 A0431
> 7 59400
> 7 59400
> 8 NULL
> 8 NULL
> 8 NULL
> 8 NULL
> 9 27447
> 9 J9355
> 10 E0784
> 10 NULL
> 10 E0784
> 10 NULL
> 10 J2505
> 10 59510
> 10 NULL
> 10 NULL
> 10 E0784
> 10 59510
>
>
> I would like to create additional fields to augment the above data set for modeling.
>
>
> *****************************************************************************************************
> I would like to say in words what I am trying to compute.
>
> Maybe you can find an easier way to get the results I show below.
>
> For each patient (ID) in my sample dataset, create new data columnns to compute
> the total number of medical services (CODE) they had within any of the top medical services (CODEs)
> shown below:
>
> NULL 59510 J2505 59400 J1745 J3490 27447
> A0431 FACIL J9355 J9310 E0784
>
> The above CODEs are the medical procedures, out of many thousands of medical procedures,
> I have determined to be of interest to me.
>
> Therefore I would like to know, given the thousands of procedures, and given my 12 proceduers shwon above,
> the total number of procedures (CODEs) a patientb (ID) had within any of the 12 proceduers shown above.
>
> I would like to create counts and percentages.
> *****************************************************************************************************
>
>
>
> (1) I would like to know, for each ID, how many CODEs are found in
> the subset of codes mentioned above (I repeat them here):
> NULL 59510 J2505 59400 J1745 J3490 27447
> A0431 FACIL J9355 J9310 E0784
>
> For example, create a new field, call it, say, SUM_COUNT, to count
> the total number of codes per ID:
>
> ID CODE SUM_COUNT
> 1 J3490 1
> 1 NULL 1
> 1 36415 0
> 1 88142 0
> 1 59510 1
> 2 99213 0
> 2 80053 0
> 2 J2505 1
> 3 FACIL 1
> 3 J3490 1
> 4 99232 0
> 5 80053 0
> 5 E0784 2
> 5 E0784 2
> 5 J9310 1
> 6 A0431 1
> 7 59400 2
> 7 59400 2
> 8 NULL 4
> 8 NULL 4
> 8 NULL 4
> 8 NULL 4
> 9 27447 1
> 9 J9355 1
> 10 E0784 3
> 10 NULL 4
> 10 E0784 3
> 10 NULL 4
> 10 J2505 1
> 10 59510 2
> 10 NULL 4
> 10 NULL 4
> 10 E0784 3
> 10 59510 2
>
> (2) Now that I have counts, I would like to compute WITHIN ID percentages.
>
> For example, within ID=10, we have 10 records. Of those 10 records, we have:
>
> NULL=4 records or 4/10=40%
> E0784=3 records or 3/10=30%
> J2505=1 records or 1/10=10%
> 59510=2 records or 2/10=20%
>
> My data set now will look like:
>
> ID CODE SUM_COUNT WITHIN_PCT
> 1 J3490 1 0.20
> 1 NULL 1 0.20
> 1 36415 0 0.00
> 1 88142 0 0.00
> 1 59510 1 0.20
>
> 2 99213 0 0.00
> 2 80053 0 0.00
> 2 J2505 1 0.33
>
> 3 FACIL 1 0.50
> 3 J3490 1 0.50
>
> 4 99232 0 0.00
>
> 5 80053 0 0.00
> 5 E0784 2 0.50
> 5 E0784 2 0.50
> 5 J9310 1 0.25
>
> 6 A0431 1 1.00
>
> 7 59400 2 1.00
> 7 59400 2 1.00
>
> 8 NULL 4 1.00
> 8 NULL 4 1.00
> 8 NULL 4 1.00
> 8 NULL 4 1.00
>
> 9 27447 1 0.50
> 9 J9355 1 0.50
>
> 10 E0784 3 0.30
> 10 NULL 4 0.40
> 10 E0784 3 0.30
> 10 NULL 4 0.40
> 10 J2505 1 0.10
> 10 59510 2 0.20
> 10 NULL 4 0.40
> 10 NULL 4 0.40
> 10 E0784 3 0.30
> 10 59510 2 0.20
>
>
> (3) Finally, I woul like to have an OVERALL percentage.
>
> For example, lets take ID=1. (We have a TOTAL of 34 records.)
>
> I would like a new field, call it, OVERALL_PCT.
>
> For ID=1 and CODE='J3490' we will have SUM_COUNT/34 = 1/34 = 0.02941 (or 2.94%).
> For ID=1 and CODE='NULL' we will have SUM_COUNT/34 = 1/34 = 0.02941 (or 2.94%).
> For ID=1 and CODE='36415' we will have SUM_COUNT/34 = 0/34 = 0.00000 (or 0.00%).
> For ID=1 and CODE='88142' we will have SUM_COUNT/34 = 0/34 = 0.00000 (or 0.0%).
> For ID=1 and CODE='59510' we will have SUM_COUNT/34 = 1/34 = 0.02941 (or 2.94%).
>
> ID CODE SUM_COUNT OVERALL_PCT
> 1 J3490 1 0.02941
> 1 NULL 1 0.02941
> 1 36415 0 0.00000
> 1 88142 0 0.00000
> 1 59510 1 0.02941
>
>
> Therefore, my final desired data set will look like:
>
> ID CODE SUM_COUNT WITHIN_PCT OVERALL_PCT
> 1 J3490 1 0.20 0.02941
> 1 NULL 1 0.20 0.02941
> 1 36415 0 0.00 0.00000
> 1 88142 0 0.00 0.00000
> 1 59510 1 0.20 0.02941
> 2 99213 0 0.00 0.00000
> 2 80053 0 0.00 0.00000
> 2 J2505 1 0.33 0.02941
> 3 FACIL 1 0.50 0.02941
> 3 J3490 1 0.50 0.02941
> 4 99232 0 0.00 0.00000
> 5 80053 0 0.00 0.00000
> 5 E0784 2 0.50 0.05882
> 5 E0784 2 0.50 0.05882
> 5 J9310 1 0.25 0.02941
> 6 A0431 1 1.00 0.02941
> 7 59400 2 1.00 0.05882
> 7 59400 2 1.00 0.05882
> 8 NULL 4 1.00 0.12500
> 8 NULL 4 1.00 0.12500
> 8 NULL 4 1.00 0.12500
> 8 NULL 4 1.00 0.12500
> 9 27447 1 0.50 0.02941
> 9 J9355 1 0.50 0.02941
> 10 E0784 3 0.30 0.08823
> 10 NULL 4 0.40 0.11765
> 10 E0784 3 0.30 0.08823
> 10 NULL 4 0.40 0.11765
> 10 J2505 1 0.10 0.02941
> 10 59510 2 0.20 0.05882
> 10 NULL 4 0.40 0.11765
> 10 NULL 4 0.40 0.11765
> 10 E0784 3 0.30 0.08823
> 10 59510 2 0.20 0.05882
>
> Thank so much!!!
> tom
>
> --
> We've Got Your Name at http://www.mail.com !
> Get a FREE E-mail Account Today - Choose From 100+ Domains
>
|