Date: Sun, 13 May 2007 22:22:48 -0700
Reply-To: David L Cassell <davidlcassell@MSN.COM>
Sender: "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From: David L Cassell <davidlcassell@MSN.COM>
Subject: Re: how to calculate the median by groups
In-Reply-To: <200705131824.l4DAkdgw013722@mailgw.cc.uga.edu>
Content-Type: text/plain; format=flowed
Howard expertly essayed:
>
>On Mon, 7 May 2007 22:26:07 -0700, David L Cassell <davidlcassell@MSN.COM>
>wrote:
>
> >Huang.JS@PRINCIPAL.COM replied:
>
>[snip]
>
> >
> >This works.
> >
> >BUT it does not scale well, and it is about the least efficient way to
> >compute medians. Okay, second-least efficient: there's a way to do
> >it in SQL that looks a lot less efficient.
> >
> >SAS has methods of computing medians (and other quantiles) that are
> >state-of-the-art, and can even handle *huge* data sets. I recommend
> >using the SAS procs designed for these types of processes.
> >
> >I'm still waiting to find out why Austina wants medians, and why she
>wants
> >medians in SQL...
> >
> >HTCT,
> >David
> >--
> >David L. Cassell
> >mathematical statistician
> >Design Pathways
> >3115 NW Norwood Pl.
> >Corvallis OR 97330
>
>I agree with David.
>
>But perhaps medians in SQL are are worth exploring out of curiosity. Celko
>has 8 different ways (SQL for Smarties, 2nd ed., pp. 353-365). I think it's
>safe to say that they are all convoluted and all inefficient.
>
>So I set out to develop my own convoluted and inefficient PROC SQL code to
>compute medians.
>
>Test data:
>
> data test;
> input groupID $ measure;
> cards;
> a 1.1
> a 2.1
> a 3.1
> a 4.1
> b 1.1
> b 2.1
> b 3.1
> c 101
> d .
> d .
> ;
>
>The first task is to get the half-populations:
>
> proc sql;
>
> create table halfpoints as
> select groupid, floor(count(measure) / 2) as halfpoint
> from test
> group by groupid;
>
>Result:
>
> groupID halfpoint
> -------------------
> a 2
> b 1
> c 0
> d 0
>
>Next, get the cumulative frequencies, in both directions:
>
> create view first as
> select measures.groupid
> , measures.measure
> , sum(measures.measure > test.measure) as lohi
> , sum(measures.measure < test.measure) as hilo
> from (select distinct groupid, measure from test) as measures
> join
> test
> on measures.groupid=test.groupid
> group by measures.groupid, measures.measure;
>
>Result:
>
> groupID measure lohi hilo
> --------------------------------------
> a 1.1 0 3
> a 2.1 1 2
> a 3.1 2 1
> a 4.1 3 0
> b 1.1 0 2
> b 2.1 1 1
> b 3.1 2 0
> c 101 0 0
> d . 0 0
>
>Now derive what might be called "candidate medians":
>
> create view second as
> select groupid
> , case when lohi <= (select halfpoint
> from halfpoints
> where groupid=first.groupid)
> then measure
> else .
> end as lohi,
> case when hilo <= (select halfpoint
> from halfpoints
> where groupid=first.groupid)
> then measure
> else .
> end as hilo
> from first;
>
>Result:
>
> groupID lohi hilo
> ----------------------------
> a 1.1 .
> a 2.1 2.1
> a 3.1 3.1
> a . 4.1
> b 1.1 .
> b 2.1 2.1
> b . 3.1
> c 101 101
> d . .
>
>Finally, pick the extreme values and average them:
>
> create table medians as
> select groupid, mean(max(lohi),min(hilo) ) as Median
> from second
> group by groupid;
>
>Result:
>
> groupID Median
> ------------------
> a 2.6
> b 2.1
> c 101
> d .
>
>Or, get the same answers by rolling all of the views into one big
>statement:
>
> create table medians as
> select groupid
> , mean( max(case when lohi <= (select halfpoint
> from halfpoints
> where groupid=outer.groupid)
> then measure
> else .
> end)
> , min(case when hilo <= (select halfpoint
> from halfpoints
> where groupid=outer.groupid)
> then measure
> else .
> end)
> ) as median
> from (select measures.groupid
> , measures.measure
> , sum(measures.measure > test.measure) as lohi
> , sum(measures.measure < test.measure) as hilo
> from (select distinct groupid, measure from test) as measures
> join
> test
> on measures.groupid=test.groupid
> group by measures.groupid, measures.measure
> ) as outer
> group by groupid
> ;
>
> quit;
Yowsers!
Can you do a weighted median too? I spent some time last summer
trying to do that in PROC SQL, without enough satisfactory results
to use it in production code.
Your HALFPOINT variable would have to be
> proc sql;
>
> create table halfpoints as
> select groupid, sum(SamplingWeight)/2 as halfpoint
> from test
> group by groupid;
And you would still have to check whether the halfpoint falls
right in between two distinct values, evne if that is now a
much smaller likelihood.
David
--
David L. Cassell
mathematical statistician
Design Pathways
3115 NW Norwood Pl.
Corvallis OR 97330
_________________________________________________________________
More photos, more messages, more storage—get 2GB with Windows Live Hotmail.
http://imagine-windowslive.com/hotmail/?locale=en-us&ocid=TXT_TAGHM_migration_HM_mini_2G_0507
|