Date: Thu, 7 Jun 2001 14:39:59 GMT
Reply-To: Huck <huck@SKIPTHISFINN.COM>
Sender: "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From: Huck <huck@SKIPTHISFINN.COM>
Organization: AT&T Worldnet
Subject: Re: Randomly pick
i am playing catchup, after having skipped reading for a while,
and yes this is an OLD topic,
and no i dont believe it was described by sas BEFORE sas-l existed,
but ill accept it was described by them before it was described HERE.
one of the former times this came up, i posted the k/n code,
mentioned how i had it explained at a SUGI, and while i could
understand it, i sure wasnt going to try to post what i belive is the
proof quoted below.
and i believe it was Dr John who came back and said
"musta been me"... yea i think so too.
ok, at that sugi, he also gave a shorter "proof",
it kinda went like this,
imagine we created all possible samples K of universe N,
and picked one at random, is that a random sample?
and he did a like 5 line "proof" and showed
"well see how this generates all samples and we pick one."
i belive it involved giving an "order count" to the possible samples
and somehow the sequence of the random #'s provided
the number of the sample to pick as our random sample.
(if i knew i was gonna repeat this so many times
i would have taken notes)
that description convinced me
(tho i realy needed it as i was already using it.......)
am i right dr john, was that you and do you remember that one?
it wasnt as "deep" as this one.........
and it never came out in this discussion that i can see
Tom
ps and dr john pointed out then how to do it without doing the set for
all obs, but note that when reading certain types of datasets
like tapes, your kinda stuck with it anyway, so i made two macros
one for where i could use point, and one for where i couldnt
"only your programmer knows for sure "
On 11 May 01 01:22:07 GMT, John.W@MEDISCIENCE.CO.UK (John Whittington)
wrote:
>At 17:42 10/05/01 -0500, Jonathan Goldberg wrote (in part):
>
>>It's been too long since my probability theory days for me to give an
>>formal derivation.
>>In general, the K/N method works because shifts in ongoing selection
>>probability are random. If we pick more at first, we pick fewer later,
>>and vice versa. Either is equally likely to happen. All records have the
>>same a priori selection probability, as required.
>
>Like many people, when I first came across the 'k/n algorithm' (which
>appeared to be highly couter-intuitive), I simply did not believe it could
>work - but a bit of playing with probability theory eventually convinced me.
>
>I think it may be an appropriate time to re-post the 'proof' which I once
>painstaking typed out in ASCII a while back whene this discussion was going
>on here on SAS-L, as follows. The exercise is to randomly select k from a
>population of n (the following should ideally be viewed using a
>NON-proprtionally-specaed font) ......
>
>... The definition of 'random' obviously means that the probability of
>selection of *any* of the elements has to be the same, and equal to k/n.
>
>Consider the very first item considered. There is a k/n probability of it
>being selected into the sample. When it comes to the next item, n will
>*always* have been decremented by the algorithm to (n-1) and there are two
>possibilities as regards k:
>
>a)... there is a k/n probability that the first item was selected, in which
>case the algorithm will have decremented k to (k-1). In this case, the
>probability of the second item being selected during this second step is
>therefore ((k-1)/(n-1)) and thus the overall probability of this having
>happened is thus:
>
> (k/n) * ((k-1)/(n-1)) .... (1)
>
>b)... there is a (1 -(k/n)) probability that the first item was *not*
>selected, in which case the algorithm will *not* have decremented k. In
>this case, the probability of the second item being selected during this
>second step is therefore (k/(n-1)) and thus the overall probability of this
>having happened is thus:
>
> (1-(k/n)) * (k/(n-1)) .... (2)
>
>Adding (1) and (2) we get the total probability of the second item being
>selected into the sample, namely:
>
> [ (k/n) * ((k-1)/(n-1)) ] + [ (1-(k/n)) * (k/(n-1)) ]
>
> ( k ) ( k-1 ) ( k )( k )
> = ( - ) (-----) + ( 1 - - )(-----)
> ( n ) ( n-1 ) ( n )( n-1 )
>
> ( k ) ( k-1 ) ( n - k )( k )
> = ( - ) (-----) + ( ----- )(-----)
> ( n ) ( n-1 ) ( n )( n-1 )
>
> ( k-1 ) ( k )
> = k (------ ) + (n-k)( ------ )
> ( n(n-1) ) ( n(n-1) )
>
> k(k-1) + k(n-k)
> = ---------------
> n(n-1)
>
> k^2 - k + kn - k^2
> = ------------------
> n(n-1)
>
> k(n-1)
> = ------
> n(n-1)
>
> k
> = ---
> n
>
>Hence the probability of the second item, using this algorithm is also k/n.
> The same argument can obviously used for any stage during the selection
>process, so that the probability of *any* particular item beings elected is
>k/n - hence, random.
>
>I hope I'm explained that clearly enough - algebra via ASCII is not too
>clever!
>
>Kind Regards,
>
>
>John
>
>----------------------------------------------------------------
>Dr John Whittington, Voice: +44 (0) 1296 730225
>Mediscience Services Fax: +44 (0) 1296 738893
>Twyford Manor, Twyford, E-mail: John.W@mediscience.co.uk
>Buckingham MK18 4EL, UK mediscience@compuserve.com
>----------------------------------------------------------------
|