LISTSERV at the University of Georgia
Menubar Imagemap
Home Browse Manage Request Manuals Register
Previous messageNext messagePrevious in topicNext in topicPrevious by same authorNext by same authorPrevious page (July 2002, week 3)Back to main SAS-L pageJoin or leave SAS-L (or change settings)ReplyPost a new messageSearchProportional fontNon-proportional font
Date:   Thu, 18 Jul 2002 11:06:04 -0400
Reply-To:   "Delaney, Kevin P." <khd8@CDC.GOV>
Sender:   "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From:   "Delaney, Kevin P." <khd8@CDC.GOV>
Subject:   Reading messy Genomic data
Content-Type:   text/plain; charset="iso-8859-1"

Let's see if I can describe my problem adequately...

I have received raw lab data (is there anything uglier :-)) from a group charged with sequencing about 1500 bases for us...

The data should be of the form: 15 digit identifier UBA123456789123 followed by the 1500 or so bases:

Two isssues: 1) the file they sent is one continuous record with no delimiters, but with several unprintable characters that are mixed in in no obvious pattern 2) the number of bases sequenced is not constant, so I can't say something like Input Id: $15. Sequence: $1500. @@;

What I have done is to read in the whole file (luckily small ~ 200KB), compress out the special characters, Reread the file one character at a time and output a delimited file looking for U's (luckily I have DNA sequences) and placing the delimiter before them...

Then I read my delimited file back in...

My current code looks something like this...

DATA work1; *(keep=fixseq flag);

INFILE "&path.00122sequences.txt" truncover end=eof lrecl=1048576 length=varlength ;

input @1 sequence $2000. @; varlen=varlength; do i=0 to varlength by 2000; INPUT @1 +i sequence $2000. @; fixseq=compress(translate(sequence," ",collate(0,36))) ; *if indexc(trim(left(fixseq)),"U")>0 then flag=1; *else flag=0; output; end; run;

filename junk "&path.delimitedfile.txt";

data _null_; set work1 end=eof; length a $1; file junk lrecl=300000; do i=1 by 1 until(a=" "); a=substr(fixseq,i,1); if a="U" then do; put ","a +(-1) @; end; else do; put a +(-1) @; end; end; run;

data work2(keep=id bases); infile junk dsd dlm="," lrecl=1048576 length=varlength; length sequence $2000; input sequence: $2000. @@; totlen=length(sequence); id=substr(sequence,1,15); bases=substr(sequence,16); run;

This works ok for this small file, but I am thinking there has to be better (one pass?? Maybe an array to hold values up to and after a 'U'BA id) solution??

Let me know if I didn't explain the data or the problems I am having with it well enough...

TIA for any suggestions

Kevin Delaney

Back to: Top of message | Previous page | Main SAS-L page