LISTSERV at the University of Georgia
Menubar Imagemap
Home Browse Manage Request Manuals Register
Previous messageNext messagePrevious in topicNext in topicPrevious by same authorNext by same authorPrevious page (July 2003, week 2)Back to main SAS-L pageJoin or leave SAS-L (or change settings)ReplyPost a new messageSearchProportional fontNon-proportional font
Date:   Thu, 10 Jul 2003 04:41:18 +0000
Reply-To:   sashole@bellsouth.net
Sender:   "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From:   Paul Dorfman <paul_dorfman@HOTMAIL.COM>
Subject:   Re: Cut a part out of a big flat fixed length text file
Comments:   To: sophe88@YAHOO.COM
Content-Type:   text/plain; format=flowed

PD,

Currently having no access to the real computer (alas!), I can at least try this under XP. Yes, my failure to supply END=EOF option to the SET statement results in the warning, yet because of the nature of the Data step, it does not affect the outcome (of course, it does not mean it should not be coded). Faking the input with (irrelevant notes dropped)

1 filename large "h:\sastest.txt" ; 2 data acctlist ; 3 file large ; 4 do _n_ = 1 to 15 ; 5 put @01 _n_ z15. @17 _n_ 15. ; 6 hin = put (_n_ , z15.) ; 7 if ranuni(1) < .5 then output ; 8 end ; 9 run ;

NOTE: 15 records were written to the file LARGE. NOTE: The data set WORK.ACCTLIST has 6 observations and 1 variables.

the rest proceeds as expected:

10 data cntlin / view = cntlin ; 11 retain fmtname 'search' type 'c' label '+' ; 12 do until ( eof ) ; 13 set acctlist ( rename = (hin=start) ) ; 14 output ; 15 end ; 16 hlo = 'O' ; *this is letter O, not zero ; 17 label = '-' ; 18 output ; 19 stop ; 20 run ;

NOTE: DATA STEP view saved on file WORK.CNTLIN.

21 proc format cntlin = cntlin ; NOTE: Variable eof is uninitialized. NOTE: Format $SEARCH has been output. 22 run ;

NOTE: There were 6 observations read from the data set WORK.ACCTLIST. NOTE: There were 6 observations read from the data set WORK.CNTLIN.

23 data _null_ ; 24 infile large ; 25 file log ; *replace with your outfile as needed ; 26 input @ 001 hin $char15. @ ; 27 if put (hin, $search.) = '+' then put _infile_ ; 28 run ;

000000000000001 1 000000000000003 3 000000000000004 4 000000000000009 9 000000000000010 10 000000000000014 14 NOTE: 15 records were read from the infile LARGE.

Now as you have the purely digit string key, the hash solution I have suggested earlier comes to the same result:

166 %let h = 300001 ; 167 data _null_ ; 168 array hk ( 0 : &h ) _temporary_ ; 169 if _n_ = 1 then do until ( eof ) ; 170 set acctlist end = eof ; 171 k = input (hin, 15.) ; 172 do h = mod (k, &h) by 1 until ( hk(h) = k ) ; 173 if h = &h then h = 0 ; 174 if hk (h) = . then hk(h) = k ; 175 end ; 176 end ; 177 infile large ; 178 input @1 k 15. @ ; 179 do h = mod (k, &h) by 1 until ( hk(h) = k | hk(h) = . ) ; 180 if h = &h then h = 0 ; 181 end ; 182 file log ; *replace with your outfile as needed ; 183 if hk(h) = k then put _infile_ ; 184 run ;

000000000000001 1 000000000000003 3 000000000000004 4 000000000000009 9 000000000000010 10 000000000000014 14

NOTE: 15 records were read from the infile LARGE. NOTE: There were 6 observations read from the data set WORK.ACCTLIST.

The relative speed of this singel-step solution could be more visible if the input were a bit more sizeable, say,

231 filename large "h:\sastest.txt" ; 232 data acctlist ; 233 file large ; 234 do _n_ = 1 to 1.5e6 ; 235 put @01 _n_ z15. @17 _n_ 15. ; 236 hin = put (_n_ , z15.) ; 237 if ranuni(1) < .1 then output ; 238 end ; 239 run ;

NOTE: 1500000 records were written to the file LARGE. NOTE: The data set WORK.ACCTLIST has 150109 observations and 1 variables.

Then the log would indicate (of course, here I am NOT writing the output in the log, but to a separately allocated external file tied to the fileref OUT) that overall, the simplest hash table outperforms the format 2:1, although 5 to 11 seconds difference may not be enough to resort to a more (externally) complex and less traditional scheme (at least before 9.1, where it becomes implemented internally and standard).

240 data cntlin / view = cntlin ; 241 retain fmtname 'search' type 'c' label '+' ; 242 do until ( eof ) ; 243 set acctlist ( rename = (hin=start) ) ; 244 output ; 245 end ; 246 hlo = 'O' ; *this is letter O, not zero ; 247 label = '-' ; 248 output ; 249 stop ; 250 run ;

NOTE: DATA STEP view saved on file WORK.CNTLIN.

251 proc format cntlin = cntlin ; NOTE: Variable eof is uninitialized. NOTE: Format $SEARCH has been output. 252 run ;

NOTE: View WORK.CNTLIN.VIEW used (Total process time): real time 2.60 seconds cpu time 2.59 seconds

NOTE: PROCEDURE FORMAT used (Total process time): real time 2.62 seconds cpu time 2.60 seconds

NOTE: There were 150109 observations read from the data set WORK.ACCTLIST. NOTE: There were 150109 observations read from the data set WORK.CNTLIN.

253 filename out "h:\sastest.out" ; 254 data _null_ ; 255 infile large ; 256 input @ 001 hin $char15. @ ; 257 file out ; 258 if put (hin, $search.) = '+' then put _infile_ ; 259 run ;

NOTE: The infile LARGE is: File Name=h:\sastest.txt, RECFM=V,LRECL=256

NOTE: The file OUT is: File Name=h:\sastest.out, RECFM=V,LRECL=256

NOTE: 1500000 records were read from the infile LARGE. NOTE: 150109 records were written to the file OUT. NOTE: DATA statement used (Total process time): real time 5.81 seconds cpu time 5.65 seconds

260 261 %let h = 300001 ; 262 263 data _null_ ; 264 array hk ( 0 : &h ) _temporary_ ; 265 if _n_ = 1 then do until ( eof ) ; 266 set acctlist end = eof ; 267 k = input (hin, 15.) ; 268 do h = mod (k, &h) by 1 until ( hk(h) = k ) ; 269 if h = &h then h = 0 ; 270 if hk (h) = . then hk(h) = k ; 271 end ; 272 end ; 273 infile large ; 274 input @1 k 15. @ ; 275 do h = mod (k, &h) by 1 until ( hk(h) = k | hk(h) = . ) ; 276 if h = &h then h = 0 ; 277 end ; 278 file out ; 279 if hk(h) = k then put _infile_ ; 280 run ;

NOTE: 1500000 records were read from the infile LARGE. NOTE: 150109 records were written to the file OUT. NOTE: There were 150109 observations read from the data set WORK.ACCTLIST. NOTE: DATA statement used (Total process time): real time 4.99 seconds cpu time 4.79 seconds

If you increase the input size 10 times over, the performance difference is likely to become quite staggering, not to mention that under agerage circumstances, FORMAT will probably prove to be way too memory-hungry. However, remember that in this case, the small file will contain about 1.5e6 keys, so the value assigned to the macrovariable H should be the smallest prime number greater than approximately twice 1.5e6 (+- 0.5*e6 will not make much of a difference as long as &H is prime).

Kind regards, ================== Paul M. Dorfman Jacksonville, FL ==================

>From: PD <sophe88@YAHOO.COM> > >Hi, Paul, > >Below is the code I sumbit to my Os/390 sas 8.1 today. > >' DATA CNTLIN /VIEW=CNTLIN; > RETAIN FMTNAME 'SEARCH' TYPE 'C' LABEL '+'; > DO UNTIL ( EOF ); SET TWO(RENAME=(HIN=START)); > OUTPUT; END; HLO='O'; LABEL='-'; OUTPUT; STOP; RUN; > PROC FORMAT CNTLIN=CNTLIN; RUN; > DATA _NULL_; FILE CIA; INFILE INPT; > INPUT @16 KEY $CHAR15. @; > IF PUT (HIN, SEARCH.) = '+' THEN PUT _INFILE_; RUN; ' > >where Hin is a 15 byte character and key is 15 byte on the infile. > >I ran into two problems. 1. SAS said variable EOF not initialized. >2. It says format search can not be loaded. > >In JCL, I set up CIA so that it is written onto a tape. I can switch >it to cyc if necessary, but I don't think this is the problem. Also, >neither HIN nor KEY has any character. They are all like >000000000123456. > >The View cntlin has been successfully created. Your help is >appreciated. > >PD > >paul_dorfman@HOTMAIL.COM (Paul Dorfman) wrote in message >news:<BAY2-F57SzbAUfKRfcj0000010a@hotmail.com>... > > PD, > > > > All you need is a decent algorithm to store the 150K keys in a > > memory-resident table and then search the table for each of the 42M >records > > read from the tape, discarding those whose keys are not found in the >table. > > The format you have mentioned stores the keys in an AVL tree, so looking > > them up should be quick, and since a 150K table is rather miniscule to > > handle even for Proc FORMAT (well-known for its memorial voracity), it > > should not spawn out-of-memory horror stories, either. > > > > Suppose your large file is attached to a filename LARGE, and the 150K >SAS > > data file called ACCTLIST contains the single 16-byte variable ACCTNO. >Also > > suggest that in the large file, the account number occupies the leftmost >16 > > bytes. Then you might proceed as follows: > > > > data cntlin / view = cntlin ; > > retain fmtname 'search' type 'c' label '+' ; > > do until ( eof ) ; > > set acctlist ( rename = (acctno=start) ) ; > > output ; > > end ; > > hlo = 'O' ; *this is letter O, not zero ; > > label = '-' ; > > output ; > > stop ; > > run ; > > > > proc format cntlin = cntlin ; > > run ; > > > > data _null_ ; > > infile large ; > > input @ 001 acctno $char16. @ ; > > if put (acctno, search.) = '+' then put _infile_ ; > > run ; > > > > That is likely to provide all the performance you need, because the >process > > is I/O bound, and an increase in the speed of the table look-up will be > > offset by the overhead of reading the tape. However, if your key is >actually > > a digit string (i.e. is composed solely of [decimal] digits), and the >desire > > for better CPU performance is overwhelmingly powerful, another, >one-step, > > solution could be > > > > %let h = 300001 ; > > > > data _null_ ; > > array hk ( 0 : &h ) _temporary_ ; > > if _n_ = 1 then do until ( eof ) ; > > set acctlist end = eof ; > > do h = mod (acctno, &h) by 1 until ( hk(h) = acctno ) ; > > if h = &h then h = 0 ; > > if hk (h) = . then hk(h) =acctno ; > > end ; > > end ; > > infile large ; > > input @1 acctno 16. @ ; > > do h = mod (acctno, &h) by 1 until ( hk(h) = acctno | hk(h) = . ) ; > > if h = &h then h = 0 ; > > end ; > > if hk(h) = acctno then put _infile_ ; > > run ; > > > > Please note that the code is strictly off the top of a sick head, but >there > > is a possibility it will work. HTH. > > > > Kind regards, > > ------------------------------- > > Paul M. Dorfman > > Jacksonville, FL > > ------------------------------- > > > > >From: PD <sophe88@YAHOO.COM> > > >Reply-To: PD <sophe88@YAHOO.COM> > > >To: SAS-L@LISTSERV.UGA.EDU > > >Subject: Cut a part out of a big flat fixed length text file > > >Date: Wed, 25 Jun 2003 21:00:23 -0700 > > > > > >I have a big fixed length text file with about 2000 vars. The primary > > >key is a 16-character var. There are about 42 millions obs in the > > >file. It sits on a OS390 tape image. (can not browse it, but > > >reading/sharing is fine). I have another SAS dataset that has about > > >150k of this 16-charter var. > > > > > >I need to preseve the layout of the text file, but only want to > > >records for the 150k people. I have the layout for the text file, but > > >do not want to write them into a SAS data set, afraid the layout may > > >not be preseved. > > > > > >I used to run a JCL based Syncsort program that allows me to > > >match-merge two text files using a common key without transferring > > >them to SAS files. Recently the program is not functioning properly. > > >So I am resorting to SAS again. > > > > > >Heard about CNTln option in proc format. Don't know how relevant that > > >is to my problem. > > > > > >Thank you. > > >PD > > > > _________________________________________________________________ > > MSN 8 with e-mail virus protection service: 2 months FREE* > > http://join.msn.com/?page=features/virus

_________________________________________________________________ Protect your PC - get McAfee.com VirusScan Online http://clinic.mcafee.com/clinic/ibuy/campaign.asp?cid=3963


Back to: Top of message | Previous page | Main SAS-L page