| Date: | Tue, 2 Dec 2003 14:16:05 -0500 |
| Reply-To: | Ian Whitlock <WHITLOI1@WESTAT.COM> |
| Sender: | "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU> |
| From: | Ian Whitlock <WHITLOI1@WESTAT.COM> |
| Subject: | Re: Calculate highest Mode for character variables |
|
| Content-Type: | text/plain |
|---|
Pankaj,
You will have to write the code for such a specialized problem. here is a
model using your sample data.
data w ; input
key1 $ key2 $ c1 $ c2 $ ;
cards ;
123 abc TB ban
123 abc TB ban
123 abc TB can
123 abc TB ban
123 abc PS ban
123 abc PS ban
123 abc KT ban
123 abc KT man
124 abd TM zan
125 abe TN lan
125 abe TN lan
125 abe TN van
125 abe PS van
125 abe PS van
126 adf NE pan
126 adf KT pan
;
proc freq data = w ;
table key1 * key2 / missing noprint out = fkey ;
table key1 * key2 * c1 / missing noprint out = fc1 ;
table key1 * key2 * c2 / missing noprint out = fc2 ;
run ;
proc sort data = fc1 ;
by key1 key2 descending count c1 ;
run ;
proc sort data = fc2 ;
by key1 key2 descending count c2 ;
run ;
data q ;
merge fkey ( rename = ( count = total ) drop = percent )
fc1 ( rename = ( count = c1_high_mode ) drop = percent )
fc2 ( rename = ( count = c2_high_mode ) drop = percent )
;
by key1 key2 ;
if first.key2 ;
c1_percent = c1_high_mode / total ;
c2_percent = c2_high_mode / total ;
run ;
The code could be easily macroized to handle 15 C-variables. If you need
help then ask.
The 80 million records may cause more trouble. It is quite likely that the
number of categories is too many for one PROC FREQ to handle. PROC SUMMARY
is probably better but still may need to be broken down into several runs.
IanWhitlock@westat.com
-----Original Message-----
From: Pankaj Singal [mailto:pankaj.singal@WELLSFARGO.COM]
Sent: Tuesday, December 02, 2003 12:17 PM
To: SAS-L@LISTSERV.UGA.EDU
Subject: Calculate highest Mode for character variables
Hi All:
We are having a problem to calculate modes for character variables,
This is what we want
We need to calculate the mode for a number of variables and keep the highest
frequency. Also, if there is more than one value with the same frequency, we
just want to keep one of the values of mode.
e.g.
Here is my data (originally we have 80 million records with 15
columns)
key1 key2 c1 c2
123 abc TB ban
123 abc TB ban
123 abc TB can
123 abc TB ban
123 abc PS ban
123 abc PS ban
123 abc KT ban
123 abc KT man
124 abd TM zan
125 abe TN lan
125 abe TN lan
125 abe TN van
125 abe PS van
125 abe PS van
126 adf NE pan
126 adf KT pan
This is what we need finally
by key1 key2,
key1 key2 c1 c1_percent c1_high_mode c2 c2_percent
c2_high_mode
123 abc TB .50 4 ban .75 6
124 abd TM 1.0 1 zan 1.0 1
125 abe TN .6 3 van .6 3
126 adf NE .5 1 pan 1.0 2
Could anyone help suggest a SAS utility/procedure to generate this kind of
output.
Any help would be highly appreciated.
Pankaj Singal
|