|Date: ||Wed, 15 Jun 2005 18:21:18 -0400|
|Reply-To: ||Dwyer Ted <DWYERT@pcsb.org>|
|Sender: ||"SPSSX(r) Discussion" <SPSSX-L@LISTSERV.UGA.EDU>|
|From: ||Dwyer Ted <DWYERT@pcsb.org>|
|Subject: ||Re: embedded codes in my Data problems|
|Content-Type: ||text/plain; charset="iso-8859-1"|
Thank you your solution worked (metapad)
The "offical counts" and the resulting counts after I opened up the file in metapad and saved it were consistent. I will be scrutinizing the data closer tomorrow, however the program seems to have successfully stripped the offending characters.
From: SPSSX(r) Discussion [mailto:SPSSX-L@LISTSERV.UGA.EDU] On Behalf Of Marta García-Granero
Sent: Wednesday, June 15, 2005 11:03 AM
Subject: Re: embedded codes in my Data problems
I had a somewhat related problem a lot of time ago (exporting Amiga
documents to a PC with Windows). A lot of control codes (ASCII values
under 32) where embedded in the text documents. We wrote a tiny BASIC
program that read sequentially the files and replaced any byte under
32 with a " " (ASCII code 32). Then we were able to read the texts
files with Word (and add all the lost format again). I have lost that
program, but I don't think it will be difficult to write, I wasn't an
expert then (nor now), but it took me less than half an hour (and a
GWBASIC manual at hand) to write it. In pseudocode, it went more or
less like this:
- Ask for the input filename & the output filename
- OPEN first filename as INPUT and 2nd as OUTPUT
- WHILE not EOF
- READ a byte from the first file
- IF the value was under 32, replace it by 32 (a blank)
- WRITE the byte in 2nd file
- CLOSE both files
- END program.
You can also try METAPAD. It's a sort of Notepad program, but it's
able to read greater files, and eliminates codes it can't translate to
characters authomatically (it issues a warning about non readable
characters and nulls).
It can be downloaded from: http://liquidninja.com/metapad/
DT> I have multiple large data files sometimes with millions of records but
DT> usually with only about 100K+ or so.
DT> Sometimes (and with a recent alarming increase) they have command codes
DT> embedded that SPSS sees as end of file or end of record commands.
DT> When I look at the file with a text editor I can see nothing
DT> When I look with a hex editor there are a variety of different codes.
DT> The only way that I can go through and eliminate the codes is using the
DT> hex editor which is a painstaking process which I would like to avoid.
DT> Does anyone know a method of stripping out embedded codes?
DT> The files are too big for excel. (Excel has the clean command which has
DT> worked in the past but only for the smaller of my datasets.)
DT> Access has trouble with the files as well.