Date: Sat, 9 Apr 2011 21:36:39 +0000
Reply-To: "Zack, Matthew M. (CDC/ONDIEH/NCCDPHP)" <mmz1@CDC.GOV>
Sender: "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From: "Zack, Matthew M. (CDC/ONDIEH/NCCDPHP)" <mmz1@CDC.GOV>
Subject: Re: Trouble reading a very large ASCII file perhaps due to
'0d0a'x (Carriage-Return + Line-Feed) within variable: SAS v 9.13
In-Reply-To: <7F795C38-7369-4BF8-890B-71D019CD72C1@gmail.com>
Content-Type: text/plain; charset="iso-2022-jp"
Thank you.
The end-of-file marker (Control-Z) within an ASCII field seems the most likely possibility now.
I'll have to read up on the ENCODING option if the other suggested solutions that SAS-Lers
do not work.
Matthew Zack
-----Original Message-----
From: Lingqun [mailto:lingqun@gmail.com]
Sent: Saturday, April 09, 2011 2:52 PM
To: Zack, Matthew M. (CDC/ONDIEH/NCCDPHP)
Cc: SAS-L@LISTSERV.UGA.EDU
Subject: Re: Trouble reading a very large ASCII file perhaps due to '0d0a'x (Carriage-Return + Line-Feed) within variable: SAS v 9.13
You may try option ENCODING=
$B:_ (B Apr 9, 2011 $B!$ (B12:49 PM $B!$ (B"Zack, Matthew M. (CDC/ONDIEH/NCCDPHP)" <mmz1@CDC.GOV> $B<LF;!' (B
> Thank you for your suggestion.
>
> I'll try it out.
>
> Matthew Zack
>
> From: Gabriel Rosas [mailto:rosas.gabe@gmail.com]
> Sent: Saturday, April 09, 2011 11:47 AM
> To: Zack, Matthew M. (CDC/ONDIEH/NCCDPHP)
> Subject: Re: Trouble reading a very large ASCII file perhaps due to '0d0a'x (Carriage-Return + Line-Feed) within variable: SAS v 9.13
>
> I think you're going to have to read it in byte by byte and re-write the text file before reading it in properly. The following is untested code.
>
> filename fixfile temp;
>
> data _null_;
> infile yourhugefile recfm=n lrecl=651;
> file fixfile lrecl=651;
> recpos=1;
> do while(recpos<652);
> input chktmp $1 @;
> if chktmp='0d'x then do;
> input chktmp $1 @;
> if chktmp='0a'd then recpos+2;
> end;
> else put chktmp +(-1) @;
> recpos+1;
> end;
> put;
> run;
>
>
> On Sat, Apr 9, 2011 at 10:24 AM, Zack, Matthew M. (CDC/ONDIEH/NCCDPHP) <mmz1@cdc.gov<mailto:mmz1@cdc.gov>> wrote:
> This text file is ~ 3.9 GB long and is being read using a SAS DATA step with INFILE/INPUT statements
> under Windows XP. The record length is 651, and only some of the variables/fields/columns on each record
> are being read. One of the records has a carriage-return+line-feed in the middle of one of these variables
> so that SAS stops reading and writing observations at that record (N=580,376). This record shows up in the incomplete SAS data set using the SAS Analyst as being truncated within this specific variable; all preceding variables with this record look OK, and all succeeding variables within this record are blank.
>
> Given the size of the file and the record length, the total number of records on the file should be closer
> to 6,000,000 (ten times the number I can read in). I don't have a file viewer/text editor with hex capabilities
> that can "see" if other problems are affecting the records beyond record # 580,376.
>
> I've tried the following combinations of INFILE/INPUT statement options without successfully reading
> or writing these 6 million records (the NOTE to the SAS LOG indicates that only 580,376 records have
> been read and written):
>
> 1. INFILE options LRECL=651, PAD, TRUNCOVER, and MISSOVER:
>
> INFILE filename LRECL=651 PAD TRUNCOVER;
> INPUT . . . ;
>
> or
>
> INFILE filename LRECL=651 PAD MISSOVER;
> INPUT . . . ;
>
> 2. INFILE option LENGTH=xxx with two INPUT statements, one of which has a $VARYINGW. informat:
>
> LENGTH LINE $ 651;
> INFILE filename LENGTH=linelen;
> INPUT @;
> INPUT @1 LINE $VARYING651. LINELEN;
> . . . subsequent statements to parse the variable, LINE, into distinct variables/fields. . .;
>
> 3. Removing the carriage-return + line feed:
>
> LENGTH LINE LINE2 $ 651;
> INFILE filename LRECL=651 PAD TRUNCOVER;
> INPUT @1 LINE $CHAR651.;
> LINE2=COMPRESS(LINE,'0d0a'x);
> . . . subsequent statements to parse the variable, LINE2, into distinct variables/fields. . .;
>
> 4. Using the INFILE statement options, FIRSTOBS=nnnn and OBS=nnnnn, to read past the troublesome record,
> perhaps with two separate DATA steps to read records before and after this record:
>
> DATA TEMP1;
> INFILE filename FIRSTOBS=1 OBS=580375 LRECL=651 PAD TRUNCOVER;
> INPUT . . . .;
> OUTPUT TEMP1;
> RUN;
>
> DATA TEMP2;
> INFILE filename FIRSTOBS=580377 LRECL=651 PAD TRUNCOVER;
> INPUT . . . .;
> OUTPUT TEMP2;
> RUN;
>
> PROC APPEND DATA=TEMP2 BASE=TEMP1;
> RUN;
>
> PROC DATASETS LIBRARY=WORK NOLIST;
> DELETE TEMP2 / MEMTYPE=DATA;
> QUIT;
>
>
> 5. Reading only variables in text column positions before the variable truncated by the Carriage-Return
> and Line-Feed (for example, VAR8 starting in column 230) on record number 580,376:
>
> DATA TEMP1;
> INFILE filename FIRSTOBS=1 OBS=580375 LRECL=651 PAD TRUNCOVER;
> INPUT @1 var1 $char20. @35 var2 $char13. . . . . . . var7 218-223;
> OUTPUT TEMP1;
> RUN;
>
> Because none of these attempted solutions reads beyond the truncated record number 580,376, 90% of the records
> are missing from the final SAS data set.
>
> Could this be a problem with Windows XP (address space limitations) or SAS version 9.13?
>
> Any other ideas for a solution?
>
> Thank you.
>
> Matthew Zack
|