Date: Thu, 18 Jul 2002 11:06:04 -0400
Reply-To: "Delaney, Kevin P." <khd8@CDC.GOV>
Sender: "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From: "Delaney, Kevin P." <khd8@CDC.GOV>
Subject: Reading messy Genomic data
Content-Type: text/plain; charset="iso-8859-1"
Let's see if I can describe my problem adequately...
I have received raw lab data (is there anything uglier :-)) from a group
charged with sequencing about 1500 bases for us...
The data should be of the form: 15 digit identifier UBA123456789123 followed
by the 1500 or so bases:
1) the file they sent is one continuous record with no delimiters, but
with several unprintable characters that are mixed in in no obvious pattern
2) the number of bases sequenced is not constant, so I can't say
Input Id: $15. Sequence: $1500. @@;
What I have done is to read in the whole file (luckily small ~ 200KB),
compress out the special characters, Reread the file one character at a time
and output a delimited file looking for U's (luckily I have DNA sequences)
and placing the delimiter before them...
Then I read my delimited file back in...
My current code looks something like this...
DATA work1; *(keep=fixseq flag);
truncover end=eof lrecl=1048576 length=varlength ;
input @1 sequence $2000. @;
do i=0 to varlength by 2000;
INPUT @1 +i sequence $2000. @;
fixseq=compress(translate(sequence," ",collate(0,36))) ;
*if indexc(trim(left(fixseq)),"U")>0 then flag=1;
filename junk "&path.delimitedfile.txt";
set work1 end=eof;
length a $1;
file junk lrecl=300000;
do i=1 by 1 until(a=" ");
if a="U" then do;
put ","a +(-1) @;
put a +(-1) @;
data work2(keep=id bases);
infile junk dsd dlm="," lrecl=1048576 length=varlength;
length sequence $2000;
input sequence: $2000. @@;
This works ok for this small file, but I am thinking there has to be better
(one pass?? Maybe an array to hold values up to and after a 'U'BA id)
Let me know if I didn't explain the data or the problems I am having with it
TIA for any suggestions