Date: Fri, 23 Jul 2004 17:55:48 +0200
Reply-To: "Groeneveld, Jim" <jim.groeneveld@VITATRON.COM>
Sender: "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From: "Groeneveld, Jim" <jim.groeneveld@VITATRON.COM>
Subject: Re: Binary Data in Raw Data Inputs
Content-Type: text/plain; charset="iso-8859-1"
Hi Mark,
Characters with ascii values 128-255 are not nonprintable. They are printable quite well, but give different characters in different character sets. See http://www.asciitable.com/ for the "extended" ascii character in the PC-8 character set.
Regards - Jim.
--
. . . . . . . . . . . . . . . .
Jim Groeneveld, MSc.
Biostatistician
Science Team
Vitatron B.V.
Meander 1051
6825 MJ Arnhem
Tel: +31/0 26 376 7365
Fax: +31/0 26 376 7305
Jim.Groeneveld@Vitatron.com
www.vitatron.com
My statistics are quite predictable, but my computer may be quite unpredictable.
[common disclaimer]
-----Original Message-----
From: Terjeson, Mark [mailto:TERJEM@DSHS.WA.GOV]
Sent: Friday, July 23, 2004 17:29
To: SAS-L@LISTSERV.UGA.EDU
Subject: Re: Binary Data in Raw Data Inputs
PS: a FILENAME statement below helps.
Also, control characters are 0-31
space 32, printable characters 32-126,
delete/rubout 127, other nonprintable
characters 128-255 (usually for PC symbols)
-----Original Message-----
From: Terjeson, Mark [mailto:TERJEM@DSHS.WA.GOV]
Sent: Friday, July 23, 2004 8:18 AM
To: SAS-L@LISTSERV.UGA.EDU
Subject: Re: Binary Data in Raw Data Inputs
Hi,
Like everyone has mentioned the hex=1A decimal=26 char=^Z
can be overcome a couple of the aforementioned ways. A hex
viewer sure makes it easy to see those nonprintable character
values and where they are at. One example is UltraEdit, just
hit cntl-H and it will toggle your text into hex representation
which is real handy.
If you are stuck without an editor to see the byte values,
either in decimal or hex values, you can always do it in SAS!
You can tweak programs such as these to stream through a file
and remove bad characters, or change certain characters, or
even add characters, etc.
Some folks have heard the terms printable and nonprintable
characters, but what are they? A byte can contain the decimal
values from 0-255. Each of these 256 values has been assigned
letter/number/symbol/controlcode meanings. In essence, the value
15 can have different visual representations, such as 15 in
decimal, 0F in hexidecimal(hex), or 17 in octal, or have a
meaning of cntl-O, or ...
If you want to check out more on what a byte is, or what
different number bases are all about you can check out:
http://listserv.uga.edu/cgi-bin/wa?A2=ind0107C&L=sas-l&P=R39591
There are two datasteps here as examples one creates a text
file with only printable characters and one datastep creates
the full spectrum of printable and nonprintable. Plus a couple
of datasteps that read the text file in, one byte at a time, and
then you can send them to a file or to the log in character
representation or decimal or hex values to investigate the bytes
yourself. These are samples for small file sizes, but you can
expand on these suggestions to handle large files as well. If a
person was looking for certain things you then can write
additional logic to look for them, change them, delete them, etc.
filename flatfile 'C:\temp\flatfile.txt';
* create sample data ;
* printable and nonprintable characters ;
data _null_;
length c $128;
file flatfile;
c = collate(0,25);
put @1 c @;
* on PCs the A1(26) ;
* is and EOF marker ;
* so have to skipit ;
c = collate(27,127);
put @26 c @;
c = collate(128,255);
put @128 c;
run;
* create sample data ;
* printable characters only ;
data _null_;
file flatfile;
put 'hello';
put 'goodbye';
run;
* read file one byte at a time ;
data pchar;
length pchar $1;
infile flatfile lrecl=1000;
input pchar $1. @@;
run;
* read file one byte at a time ;
data _null_;
length c $ 1;
fnrc=filename('foo','c:\temp\flatfile.txt');
fid=fopen('foo');
do while (fread(fid) eq 0);
recnum+1;
do i=1 to frlen(fid);
fgrc=fget(fid,c,1);
put 'just read byte ' i 'of record ' recnum
'and now ' c= $hex2. 'hex ' c=;
end;
end;
run;
Hope this is helpful,
Mark Terjeson
Reporting, Analysis, and Procurement Section
Information Services Division
Department of Social and Health Services
State of Washington
mailto:terjem@dshs.wa.gov
-----Original Message-----
From: Groeneveld, Jim [mailto:jim.groeneveld@VITATRON.COM]
Sent: Friday, July 23, 2004 1:05 AM
To: SAS-L@LISTSERV.UGA.EDU
Subject: Re: Binary Data in Raw Data Inputs
Hi Paul [C],
Well, actually just the hex 1A (byte(26)) would suffice. But it might be
worthwhile to know which control character(s), Matt actually has in his
data. Matt, could you search for some of them with a hex lister?
Regards - Jim.
--
. . . . . . . . . . . . . . . .
Jim Groeneveld, MSc.
Biostatistician
Science Team
Vitatron B.V.
Meander 1051
6825 MJ Arnhem
Tel: +31/0 26 376 7365
Fax: +31/0 26 376 7305
Jim.Groeneveld@Vitatron.com
www.vitatron.com
My statistics are quite predictable, but my computer may be quite
unpredictable.
[common disclaimer]
-----Original Message-----
From: Choate, Paul@DDS [mailto:pchoate@DDS.CA.GOV]
Sent: Friday, July 23, 2004 00:23
To: SAS-L@LISTSERV.UGA.EDU
Subject: Re: Binary Data in Raw Data Inputs
Matt -
There may be a DOS end-of-file mark in your data. SAS reads HEX 1A 0D as a
DOS end of file mark. This is documented in a SAS note at Support.SAS.COM.
---------------------------------------------------------------------------
options IgnoreDOSEOF;
SN-003632
When reading a binary file as text, the SAS System stops reading the input
file after encountering a Ctrl + Z character
----------------------------------------------------------------------------
If the SAS System encounters a Ctrl + Z or Hex 1a character when reading
a binary file as text, input stops as the character is treated as an end
of file character. There is a new option for Version 8.2, IgnoreDOSEOF,
which will allow these characters to be read.
---------------------------------------------------------------------------
Before I knew what was going on I originally got around it with a hex
editor, or I read through the file with SAS one byte at a time and fixed it.
Paul Choate
DDS Data Extraction
(916) 654-2160
-----Original Message-----
From: SAS(r) Discussion [mailto:SAS-L@LISTSERV.UGA.EDU] On Behalf Of Matt
Pettis
Sent: Thursday, July 22, 2004 2:10 PM
To: SAS-L@LISTSERV.UGA.EDU
Subject: Binary Data in Raw Data Inputs
Hi,
I am trying to read in data in IIS weblogs that *should* be just ascii
data. However, occasionally, I get fields that contain non-ascii
characters. This is confirmed by viewing the raw log in an editor and
seeing non-displayable characters (as boxes). I believe that these
characters are causing my datastep to stop and not process further lines.
These lines are rare, so I do not care if I lose this record, but I do care
that I lose all of the records after it. Does anybody have any ideas on
how do handle lines like these so that the datastep can continue past this?
Thanks in advance for any ideas,
Matt Pettis