Date: Fri, 7 Feb 2003 11:13:32 -0500
Reply-To: Ian Whitlock <WHITLOI1@WESTAT.COM>
Sender: "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From: Ian Whitlock <WHITLOI1@WESTAT.COM>
Subject: Re: Merging datasets ...
Content-Type: text/plain; charset="iso-8859-1"
Roger,
Everything you say looks good except the question. Perhaps I missed
something.
IanWhitlock@westat.com
-----Original Message-----
From: Roger Lustig [mailto:trovato@BELLATLANTIC.NET]
Sent: Thursday, February 06, 2003 5:16 PM
To: SAS-L@LISTSERV.UGA.EDU
Subject: Re: Merging datasets ...
Ian:
You're right: the solutions you give are limited by time. Don't really
know whether the data step is faster; since you're not sorting first,
you have to do about N*K/2 lookups, which can be a lot. (K is the
number of rows in the lookup table.) Or am I missing something?
If you sort both files first, you can then optimize the data step by
starting your lookups for a given w record with the lkup record that
provided a range match most recently. That way, you do at most 1 read
for each lkup record. Depending on how much the I/O costs for one read
from the lookup data set, the unsorted approach may be pretty expensive.
After all, log2(500,000) is somewhere short of 19. However expensive
each step of a search through a B+ tree may be, it's not 13,000
(=250,000/19ish) times more expensive than a read from the lkup file.
Roger
Ian Whitlock wrote:
> Johny,
>
> Here are two fast to code solutions.
>
> data w ; input number ; cards ;
> 23
> 24
> 435
> 45
> ;
>
> data lkup ; input id from to ; cards ;
> 1 1 3
> 2 4 23
> 3 26 500
> ;
>
> proc sql ;
> create table q as
> select *
> from w left join lkup
> on number between from and to
> order by number
> ;
> quit ;
>
> data q ;
> set w ;
> do pt = 1 to nobs until ( flag ) ;
> set lkup point = pt nobs = nobs ;
> if from <= number <= to then flag = 1 ;
> else
> if number < from then flag = -1 ;
> end ;
> if flag = -1 then id = . ;
> run ;
>
> The DATA step will be somewhat faster than the SQL. These solutions are
> primarily limited by time.
>
> If efficiency really matters and the number of records in LKUP are in the
> small millions or less, then there will be an array solution which will
be
> more efficient than the above.
>
> Roger's solution with a format will also be good but then LKUP should
> probably be under 500,000+. The exact limits on the array and format
> solutions will depend on how much memory you have. My numbers for
> relatively modest PC's. My guess is that these solutions are not worth
the
> effort unless you are learning, the limits are important, or you have to
run
> many times.
>
> IanWhitlock@westat.com
> -----Original Message-----
> From: Johny Bravo [mailto:manu_killer@POCZTA.ONET.PL]
> Sent: Thursday, February 06, 2003 2:25 PM
> To: SAS-L@LISTSERV.UGA.EDU
> Subject: Merging datasets ...
>
>
> Hi !
>
> I have a following problem.
> I have two datasets.
> First dataset contains numbers, e.g.:
> number
> 23
> 24
> 435
> 45
> ...
> And second dataset that contains intervals, e.g.:
> id from to
> 1 1 3
> 2 4 23
> 3 26 500
> ... .... ....
>
> I want to classify all numbers from first dataset
> to intervals in the second. I want to have on output
> dataset that contains number and an ID of particular
> interval, e.g.:
> number ID
> 23 2
> 24 null
> 435 3
> 45 3
> ... ...
> ( BTW intervals in second set are separate)
>
> I am wondering how to do it with data step and maybe "retain".
> Maybe If-clause is nessesary, i don't know :(
> The main problem is how to join datasets with diffrent number
> of observations.
>
> Thanks in advance for advice,
> Best regards,
> Johny Bravo ;)
|