Date: Mon, 27 Oct 2008 15:21:57 -0400
Reply-To: Ya Huang <ya.huang@AMYLIN.COM>
Sender: "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From: Ya Huang <ya.huang@AMYLIN.COM>
Subject: Re: Finding nearest neighbors
Ordinal() function should be a natural choice for this. Unfortunately,
ordinal handles missing values differently from other statistic function,
i.e., it actually count missing value. This make things a bit tricky.
To force ordinal 'ignore' the missing value, we can change missing value to
to a very large number, as the result, it will be ordered the last.
To make the reference of all the variable easier, we should also add
prefix in proc distance. Right now, I assume all var start with _0,
so I can refer them as _0:.
data xx;
input core $ _00614D _08185C _08394B _08545A _08558A _08562A;
cards;
00614D 0.0000 . . . . .
08185C 10.1968 0.0000 . . . .
08394B 6.7983 12.4716 0.0000 . . .
08545A 13.0307 4.5231 16.1410 0.0000 . .
08558A 7.1568 8.9766 8.1237 12.4371 0.0000 .
08562A 7.1285 4.3278 9.7992 7.3852 6.5285 0.0000
;
data yy;
set xx;
array d _0:;
do over d;
if missing(d) then d=99999;
end;
near1=ordinal(2,of _0:);
near2=ordinal(3,of _0:);
near3=ordinal(4,of _0:);
if near1=99999 then near1=.;
if near2=99999 then near2=.;
if near3=99999 then near3=.;
run;
proc print;
run;
core near1 near2 near3
00614D . . .
08185C 10.1968 . .
08394B 6.7983 12.4716 .
08545A 4.5231 13.0307 16.1410
08558A 7.1568 8.1237 8.9766
08562A 4.3278 6.5285 7.1285
On Mon, 27 Oct 2008 14:00:47 -0400, Peter Flom
<peterflomconsulting@MINDSPRING.COM> wrote:
>Hello
>
>I'd like to find the nearest neighbors for each of a large number of
subjects in a multivariate space.
>
>I found PROC DISTANCE and used it as follows
>
><<<
>proc distance data = allout out = distance;
> var interval(MAFAnT BAFAnT BRFAnT mif2f7a BAF1F7T
> BAF2F8T BAF2F7T BAF1F7S BAF2F8S BAF7F8S
> BFrF17S BFrF8ZS BRFAnB CoF7FZD MAF7T) ;
> id core;
>run;
>>>>>
>
>Now, the data set "distance" has 695 variables and 694 observations. It's
a large, lower triangular matrix of distances. The rows are CORE numbers
and the columns are the same CORE numbers, preceded by a _
>
>e.g.
>
>Obs core _00614D _08185C _08394B _08545A _08558A _08562A _08564A
_08580A _08581A
>
> 1 00614D
0.0000 . . . . . . . .
> 2 08185C 10.1968
0.0000 . . . . . . .
> 3 08394B 6.7983 12.4716
0.0000 . . . . . .
> 4 08545A 13.0307 4.5231 16.1410
0.0000 . . . . .
> 5 08558A 7.1568 8.9766 8.1237 12.4371
0.0000 . . . .
> 6 08562A 7.1285 4.3278 9.7992 7.3852 6.5285
0.0000 . . .
>
>
>with many more rows and columns.
>
>Now, I'd like to get a data set with 695 observations, and (say) 3
variables NEAREST1 NEAREST2 and NEAREST3.
>
>I figure this must have been done before, but I didn't find it ....
>
>
>Thanks
>
>Peter
>
>Peter L. Flom, PhD
>Statistical Consultant
>www DOT peterflom DOT com
|