Date: Wed, 10 Dec 2008 09:24:38 -0500
Reply-To: "Howard Schreier <hs AT dc-sug DOT org>"
<schreier.junk.mail@GMAIL.COM>
Sender: "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From: "Howard Schreier <hs AT dc-sug DOT org>"
<schreier.junk.mail@GMAIL.COM>
Subject: Re: CORRECTION: PROC SQL - select records according to FORMAT
value rather than actual value
On Wed, 10 Dec 2008 13:02:37 +1300, Adrian Slack <Adrian.Slack@VUW.AC.NZ> wrote:
>Sorry, I pasted the example code over with errors. Corrected example
>below.
>
>DATA temp00;
> INPUT id diag01 $ ecode01 $ wies ;
> CARDS ;
> 1 C000 . 0.7129
> 2 C000 W1009 0.4105
> 3 C000 Y838 6.9079
> 4 S010 . 3.0588
> 5 S010 W2208 0.2986
> 6 S010 Y870 2.4198
>;
>
>DATA temp01;
> INPUT ecode01 $;
> CARDS ;
> W1009
> W2208
>;
>
>DATA temp02 (RENAME= (ecode01=start));
> SET temp01;
> fmtname='$injury';
> label = 1;
>RUN;
>
>PROC FORMAT cntlin=temp02;
>QUIT;
>
>DATA temp03;
>SET temp01;
>FORMAT ecode01 $injury.;
>RUN;
>
>PROC SQL;
>CREATE TABLE temp04 AS
>SELECT *
>FROM temp03
>WHERE ecode01 = '1';
>QUIT;
>
>PROC PRINT DATA= temp04; QUIT;
>
>> ______________________________________________
>> From: Adrian Slack
>> Sent: Wednesday, 10 December 2008 12:53 p.m.
>> To: 'SAS-L@LISTSERV.UGA.EDU'
>> Subject: PROC SQL - select records according to FORMAT value
>> rather than actual value
>>
>> Hi
>>
>> Question:
>> How do I use a WHERE statement in PROC SQL to select a record
>> according to a variable's formatted value rather than its actual
>> value?
>>
>> E.g.:
>> Suppose X1 = C000 and applying a (character) format makes X1 = '1'.
>> Here, 1 is treated as a character rather than a number (reasoning
>> below). A straight WHERE X = '1' statement in PROC SQL doesn't pick up
>> this record, because the actual value is 'C000' not '1'.
>>
>> Background:
>> I have a large patient discharge dataset. I want to select specific
>> records if their primary diagnosis (DIAG01) or their external injury
>> (ECODE01) match the list I'm interested in, so I can ultimately sum a
>> cost variable (WIES). A record may have both a DIAG01 and ECODE01
>> that I am interested in, but I only want to count each record once.
>> The ECODE01 takes priority.
>>
>> So, I want to split the dataset into two, and analyse the resulting
>> sets separately. That is, I want a set with matched ECODE01s and
>> residual set with matching DIAG01s (the remaining, unmatched records
>> are irrelevant and I don't want to keep them in the analytical
>> datasets).
>>
>> I've managed to do this by a torturous PROC SQL where I list all the
>> ECODE01s I want to keep. But this is hard coded in with a statemetn
>> WHERE ECODE01 = 'xxxx' OR ECODE01 = 'xxxx' OR ...
>> But I want to be able to easily track what's on the list, and add or
>> remove conditions by changing an input file rather than my code.
>>
>> Partial solution:
>> I've just thought, instead, of using PROC FORMAT, as follows using a
>> dataset (TEMP01), which is a list of all the ECODE01s I want to match,
>> to create a format ($INJURY), applying this to my dataset (TEMP00),
>> and then using PROC SQL to create two datasets with the appropriate
>> records (matched ECODE01s and matched DIAG01s; I haven't done the
>> latter step for matching DIAG01s yet).
>>
>> data temp00;
>> input ID DIAG01 $ ECODE01 $ WIES ;
>> cards ;
>> 1 C000 . 0.7129 /* Keep because of
>> relevant DIAG01 */
>> 2 C000 W1009 0.4105 /* Keep because of
>> relevant ECODE01; overrides relevant DIAG01 */
>> 3 C000 Y838 6.9079 /* Keep because of
>> relevant DIAG01 despite irrelevant ECODE01 */
>> 4 S010 . 3.0588 /* Discard, nothing
>> relevant */
>> 5 S010 W2208 0.2986 /* Keep because of
>> relevant ECODE01 */
>> 6 S010 Y870 2.4198 /* Discard, nothing
>> relevant */
>> ;
>>
>> data temp01;
>> input ECODE01 $;
>> cards ;
>> W1009
>> W2208
>> ;
>>
>> DATA temp02 (rename= (ecode01=start));
>> set temp01;
>> fmtname='$injury'; /* Defines a character
>> format, so I can apply it to an existing character varialbe
>> label = 1; /* This comes out as a
>> character variable; I don't want to use a letter. (Because my format
>> is LENGTH 1, it only picks up the first element of the variable to
>> which I apply the format (and which always starts with a letter.) */
>> RUN;
>>
>> PROC FORMAT cntlin=temp02;
>> QUIT;
>>
>> DATA temp03;
>> SET temp02;
>> FROMAT ecode01 $injury.;
>> RUN;
>>
>> PROC SQL;
>> CRATE TABLE temp04 AS
>> SELECT *
>> FROM temp03
>> WHERE ecode01 = '1';
>> QUIT;
>>
>> PROC PRINT DATA= temp04; QUIT;
>>
>> NOTE: No observations in data set WORK.TEMP04.
>> NOTE: The PROCEDURE PRINT used 0.0 seconds.
>>
>> Problem:
>> There are no records where ecode01 = '1', only (some) records where
>> the format of ecode01 = '1', i.e. records 2 and 5. Can I modify the
>> WHERE statement so it targets the formatted value rather than the
>> actual value?
>>
>> Thanks in advance.
>>
>> Adrian.
>>
>> ----------------------------------------------------------------------
>> ----------------------------------------
>> School of Economics and Finance
>> Victoria University of Wellington
>>
>> P +64 4 463 5233 xtn 8571
>> E adrian.slack@vuw.ac.nz
>> A Room 315, Rutherford House, PO Box 600, Wellington, NEW ZEALAND
>> ----------------------------------------------------------------------
>> ----------------------------------------
>>
I think it is a very good idea to keep your lists of "interesting" codes in
tables, rather than embedding them into code.
I would not create two separate output files. Rather, have one file with a
categorical variable to distinguish the subsets.
SQL can probably make it all rather simple, eliminating need for formats
altogether. Instead, use subqueries to do the table lookups. A CASE
structure can impose the hierarchy and avoid double counting. The
aggregation can even be done in the same statement. The code would be
something like:
data have;
input ID DIAG01 $ ECODE01 $ WIES ;
cards ;
1 C000 . 0.7129
2 C000 W1009 0.4105
3 C000 Y838 6.9079
4 S010 . 3.0588
5 S010 W2208 0.2986
6 S010 Y870 2.4198
;
data ecodes;
input keeper $;
cards ;
W1009
W2208
;
data diags;
input keeper $;
cards ;
C000
;
proc sql;
create table need as
select
case when ecode01 in (select keeper from ecodes) then 'ecode'
when diag01 in (select keeper from diags ) then 'diag'
else 'no'
end as Category
, sum(wies) as Wies
from have
group by category;
quit;
Result:
Category Wies
diag 7.6208
ecode 0.7091
no 5.4786
|