Date: Thu, 18 Nov 1999 19:39:24 GMT
Reply-To: walter_smith@MY-DEJA.COM
Sender: "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From: walter_smith@MY-DEJA.COM
Organization: Deja.com - Before you buy.
Subject: Re: Stepping through a sequence of data until a criteria is met
In article <19991117171749.4794.rocketmail@web116.yahoomail.com>,
Stephen Arthur <sarthur67@YAHOO.COM> wrote:
Stephen
I still haven't figured out how to paste into this posting window and
have the results properly formated so the following code will look
jumbled. However, at the lower left is a link "View original Usenet
format", this will display the code properly and you can copy and paste
into your own editor.
/*********************************************************************
* This may be more than you wanted but I had fun doing it. *
* *
* This approach has the advantage of being very flexible. It can *
* search through an an arbitrary amount of data. It can search *
* for dna segments from 1 to 200 elements long. The match target *
* is easily changed, as is the match criteria. Once it finds a *
* segment matching the criteria, it saves the dna segment that *
* matches and other information (the relative starting position *
* from the beginning of the sequence, the frequency count, the *
* match percentage, and it computes the number of elements since *
* the last match. These are printed in a report but could also *
* be saved to a permanent dataset for later analysis. *
* *
* Walt Smith November 18, 1999 *
**********************************************************************
* To customize functionality, change the following macro *
* variables: *
* seglen is the length of the dna segment to search for *
* machstr is the search target in a special format for the *
* "in" operator *
* machpct is the threshhold percentage of matches that defines *
* a "hit" *
* dnafile is the name of the file containing the dna data *
* logtrace turns on/off (Y/N) data tracing to the log *
* gendata turns on/off (Y/N) the generation of random dna data *
* set off (N) to read the real data *
* NOTE: gendata=Y will overlay the contents of the *
* &dnafile - for simulation, this is OK, for real *
* data it is not. *
*********************************************************************/
%let seglen=50;
%let machstr='C','G';
%let machpct=50;
%let dnafile=dna.tmp;
%let logtrace=Y;
%let gendata=Y;
*********************************************************************;
*********************************************************************;
/*********************************************************************
* This datastep was used to generate data for testing. The bases *
* A C G & T are selected randomly based on the probabilities pa, *
* pc, pg, and pt. Change these to see what happens. *
* *
* The j loop controls how many bases per data record. The i loop *
* controls how many records are written to the file. *
*********************************************************************/
%macro gendata;
%if &gendata=Y %then %do;
%*-----define the probabilities for A C G, p(T) is calculated ----;
/*%let problist=.25 .23 .23 0;*/
/*%let problist=.25 .28 .28 0;*/
%let problist=.25 .25 .25 0;
options ls=100 ps=max;
data _null_;
file "&dnafile";
array tmp {*} $1 b1-b4 ('A' 'C' 'G' 'T');
array prob pa pc pg pt (&problist);
pt=1-pa-pc-pg;
do i=1 to 800;
do j=1 to 20;
base=tmp{ rantbl(87614256, pa , pc , pg , pt ) };
put @(2*j-1) base @;
end;
put;
end;
run;
%end;
%mend;
%gendata;
/*********************************************************************
* This is the datastep that reads the data and searches for a *
* match. When it finds a segment that meets the criteria it *
* writes the data to the "found" dataset. Of course it could be *
* changed to suit, depending on what you want to do when you *
* find a match. *
*********************************************************************/
data found;
infile "&dnafile" flowover eof=Jumpout;
array tmp $1 b1-b4 ('A' 'C' 'G' 'T');
*--------------- arrays for counting matches, proportions;
array prob pa pc pg pt;
array count ka kc kg kt;
array ttlprob ppa ppc ppg ppt;
array ttlcnt kka kkc kkg kkt;
length base $1;
length dnaseg $&seglen;
lastpos=1;
*--------------- read the file into the string dnaseg;
link ReLoad;
*--------------- search the rest of the file;
do while( 1 );
*---------------------------------------------------------------;
* when match criteria are met, output the data, calc the ;
* step length, and reload the dnsseg ;
*---------------------------------------------------------------;
pct=k/&seglen *100;
if pct >= &machpct then do;
pos=inkk -&seglen +1;
step=pos-lastpos;
output;
do over count; count=0; end;
lastpos=pos;
link ReLoad;
end;
*---------------------------------------------------------------;
* otherwise shift the dnaseg one byte left, increment, ;
* decrement counters, and read the next datum ;
*---------------------------------------------------------------;
else do;
if substr(dnaseg,1,1) in (&machstr) then k=k-1;
do over count;
if substr(dnaseg,1,1) = tmp then count=count-1;
end;
substr(dnaseg,1,1)=' ';
dnaseg=left(dnaseg);
input base @;
inkk+1;
do over count;
if base=tmp then count+1;
prob=count/&seglen;
if base=tmp then ttlcnt+1;
ttlprob=ttlcnt/inkk;
end;
if base in (&machstr) then k+1;
substr(dnaseg,&seglen,1)=base;
if "&logtrace"="Y" then do;
pct=k/&seglen *100;
put pct= dnaseg=;
end;
end;
end;
drop base i lastpos b1-b4;
format pct 3.0;
format pa pc pg pt 4.2;
format ppa ppc ppg ppt 6.4;
return;
*---------------------------------------------------------------;
* This link routine reads the data into the dnaseg var ;
*---------------------------------------------------------------;
ReLoad: *;
k=0;
dnaseg=' ';
do i=1 to &seglen;
input base @;
inkk+1;
do over count;
if base=tmp then count+1;
prob=count/&seglen;
if base=tmp then ttlcnt+1;
ttlprob=ttlcnt/inkk;
end;
if base in (&machstr) then k+1;
substr(dnaseg,i,1)=base;
if "&logtrace"="Y" then do;
pct=k/i *100;
put pct= dnaseg=;
end;
end;
return;
Jumpout: *;
run;
/*********************************************************************
* Print a simple report *
*********************************************************************/
proc print data=found label;
options ls=133 ps=78;
%let title=DNA Segments With &machpct%nrbquote(%);
%let title=&title or More Occurances of &machstr;
title1 "&title";
title2 "Data From &dnafile, DNA Segment Length &seglen";
%*----------- this var stmt gives a basic rpt;
/*var pos step k pct dnaseg;*/
%*----------- this includes the segment proportions;
var pos step k pct pa pc pg pt dnaseg;
%*----------- this includes the segment counts;
/*var pos step k pct ka kc kg kt dnaseg;*/
%*----------- this includes the cummulative proportions;
/*var pos step k pct ppa ppc ppg ppt dnaseg;*/
label k='Number of Matches'
pct='Percent of Segment Matching'
pos='Starting Position in Sequence'
step='Distance From Last Match'
dnaseg='DNA Segment'
pa='p(A)'
pc='p(C)'
pg='p(G)'
pt='p(T)'
ppa='Cumm p(A)'
ppc='Cumm p(C)'
ppg='Cumm p(G)'
ppt='Cumm p(T)'
ka='Count A'
kc='Count C'
kg='Count G'
kt='Count T';
run;
----------------------------------------------------------------------
Walt Smith | All models are wrong
wjsmith1@fedex.com | Some models are useful
Sent via Deja.com http://www.deja.com/
Before you buy.
|