Date: Sat, 9 Apr 2011 18:06:47 -0400
Reply-To: Arthur Tabachneck <art297@ROGERS.COM>
Sender: "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From: Arthur Tabachneck <art297@ROGERS.COM>
Subject: Re: Trouble reading a very large ASCII file perhaps due to
'0d0a'x (Carriage-Return + Line-Feed) within variable: SAS v 9.13
Matthew,
If your guess is correct simply add the string
ignoredoseof
on your infile statement.
Art
------
On Sat, 9 Apr 2011 21:36:39 +0000, Zack, Matthew M. (CDC/ONDIEH/NCCDPHP)
<mmz1@CDC.GOV> wrote:
>Thank you.
>
>The end-of-file marker (Control-Z) within an ASCII field seems the most
likely possibility now.
>I'll have to read up on the ENCODING option if the other suggested
solutions that SAS-Lers
>do not work.
>
>Matthew Zack
>
>
>-----Original Message-----
>From: Lingqun [mailto:lingqun@gmail.com]
>Sent: Saturday, April 09, 2011 2:52 PM
>To: Zack, Matthew M. (CDC/ONDIEH/NCCDPHP)
>Cc: SAS-L@LISTSERV.UGA.EDU
>Subject: Re: Trouble reading a very large ASCII file perhaps due to '0d0a'x
(Carriage-Return + Line-Feed) within variable: SAS v 9.13
>
>You may try option ENCODING=
>
> $B:_ (B Apr 9, 2011 $B!$ (B12:49 PM $B!$ (B"Zack, Matthew M.
(CDC/ONDIEH/NCCDPHP)" <mmz1@CDC.GOV> $B<LF;!' (B
>
>> Thank you for your suggestion.
>>
>> I'll try it out.
>>
>> Matthew Zack
>>
>> From: Gabriel Rosas [mailto:rosas.gabe@gmail.com]
>> Sent: Saturday, April 09, 2011 11:47 AM
>> To: Zack, Matthew M. (CDC/ONDIEH/NCCDPHP)
>> Subject: Re: Trouble reading a very large ASCII file perhaps due to
'0d0a'x (Carriage-Return + Line-Feed) within variable: SAS v 9.13
>>
>> I think you're going to have to read it in byte by byte and re-write the
text file before reading it in properly. The following is untested code.
>>
>> filename fixfile temp;
>>
>> data _null_;
>> infile yourhugefile recfm=n lrecl=651;
>> file fixfile lrecl=651;
>> recpos=1;
>> do while(recpos<652);
>> input chktmp $1 @;
>> if chktmp='0d'x then do;
>> input chktmp $1 @;
>> if chktmp='0a'd then recpos+2;
>> end;
>> else put chktmp +(-1) @;
>> recpos+1;
>> end;
>> put;
>> run;
>>
>>
>> On Sat, Apr 9, 2011 at 10:24 AM, Zack, Matthew M. (CDC/ONDIEH/NCCDPHP)
<mmz1@cdc.gov<mailto:mmz1@cdc.gov>> wrote:
>> This text file is ~ 3.9 GB long and is being read using a SAS DATA step
with INFILE/INPUT statements
>> under Windows XP. The record length is 651, and only some of the
variables/fields/columns on each record
>> are being read. One of the records has a carriage-return+line-feed in
the middle of one of these variables
>> so that SAS stops reading and writing observations at that record
(N=580,376). This record shows up in the incomplete SAS data set using the
SAS Analyst as being truncated within this specific variable; all preceding
variables with this record look OK, and all succeeding variables within this
record are blank.
>>
>> Given the size of the file and the record length, the total number of
records on the file should be closer
>> to 6,000,000 (ten times the number I can read in). I don't have a file
viewer/text editor with hex capabilities
>> that can "see" if other problems are affecting the records beyond record
# 580,376.
>>
>> I've tried the following combinations of INFILE/INPUT statement options
without successfully reading
>> or writing these 6 million records (the NOTE to the SAS LOG indicates
that only 580,376 records have
>> been read and written):
>>
>> 1. INFILE options LRECL=651, PAD, TRUNCOVER, and MISSOVER:
>>
>> INFILE filename LRECL=651 PAD TRUNCOVER;
>> INPUT . . . ;
>>
>> or
>>
>> INFILE filename LRECL=651 PAD MISSOVER;
>> INPUT . . . ;
>>
>> 2. INFILE option LENGTH=xxx with two INPUT statements, one of which has a
$VARYINGW. informat:
>>
>> LENGTH LINE $ 651;
>> INFILE filename LENGTH=linelen;
>> INPUT @;
>> INPUT @1 LINE $VARYING651. LINELEN;
>> . . . subsequent statements to parse the variable, LINE, into
distinct variables/fields. . .;
>>
>> 3. Removing the carriage-return + line feed:
>>
>> LENGTH LINE LINE2 $ 651;
>> INFILE filename LRECL=651 PAD TRUNCOVER;
>> INPUT @1 LINE $CHAR651.;
>> LINE2=COMPRESS(LINE,'0d0a'x);
>> . . . subsequent statements to parse the variable, LINE2, into
distinct variables/fields. . .;
>>
>> 4. Using the INFILE statement options, FIRSTOBS=nnnn and OBS=nnnnn, to
read past the troublesome record,
>> perhaps with two separate DATA steps to read records before and after
this record:
>>
>> DATA TEMP1;
>> INFILE filename FIRSTOBS=1 OBS=580375 LRECL=651 PAD TRUNCOVER;
>> INPUT . . . .;
>> OUTPUT TEMP1;
>> RUN;
>>
>> DATA TEMP2;
>> INFILE filename FIRSTOBS=580377 LRECL=651 PAD TRUNCOVER;
>> INPUT . . . .;
>> OUTPUT TEMP2;
>> RUN;
>>
>> PROC APPEND DATA=TEMP2 BASE=TEMP1;
>> RUN;
>>
>> PROC DATASETS LIBRARY=WORK NOLIST;
>> DELETE TEMP2 / MEMTYPE=DATA;
>> QUIT;
>>
>>
>> 5. Reading only variables in text column positions before the variable
truncated by the Carriage-Return
>> and Line-Feed (for example, VAR8 starting in column 230) on record
number 580,376:
>>
>> DATA TEMP1;
>> INFILE filename FIRSTOBS=1 OBS=580375 LRECL=651 PAD TRUNCOVER;
>> INPUT @1 var1 $char20. @35 var2 $char13. . . . . . . var7 218-223;
>> OUTPUT TEMP1;
>> RUN;
>>
>> Because none of these attempted solutions reads beyond the truncated
record number 580,376, 90% of the records
>> are missing from the final SAS data set.
>>
>> Could this be a problem with Windows XP (address space limitations) or
SAS version 9.13?
>>
>> Any other ideas for a solution?
>>
>> Thank you.
>>
>> Matthew Zack
|