LISTSERV at the University of Georgia
Menubar Imagemap
Home Browse Manage Request Manuals Register
Previous messageNext messagePrevious in topicNext in topicPrevious by same authorNext by same authorPrevious page (November 1999, week 3)Back to main SAS-L pageJoin or leave SAS-L (or change settings)ReplyPost a new messageSearchProportional fontNon-proportional font
Date:         Thu, 18 Nov 1999 19:39:24 GMT
Reply-To:     walter_smith@MY-DEJA.COM
Sender:       "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From:         walter_smith@MY-DEJA.COM
Organization: Deja.com - Before you buy.
Subject:      Re: Stepping through a sequence of data until a criteria is met

In article <19991117171749.4794.rocketmail@web116.yahoomail.com>, Stephen Arthur <sarthur67@YAHOO.COM> wrote:

Stephen

I still haven't figured out how to paste into this posting window and have the results properly formated so the following code will look jumbled. However, at the lower left is a link "View original Usenet format", this will display the code properly and you can copy and paste into your own editor.

/********************************************************************* * This may be more than you wanted but I had fun doing it. * * * * This approach has the advantage of being very flexible. It can * * search through an an arbitrary amount of data. It can search * * for dna segments from 1 to 200 elements long. The match target * * is easily changed, as is the match criteria. Once it finds a * * segment matching the criteria, it saves the dna segment that * * matches and other information (the relative starting position * * from the beginning of the sequence, the frequency count, the * * match percentage, and it computes the number of elements since * * the last match. These are printed in a report but could also * * be saved to a permanent dataset for later analysis. * * * * Walt Smith November 18, 1999 * ********************************************************************** * To customize functionality, change the following macro * * variables: * * seglen is the length of the dna segment to search for * * machstr is the search target in a special format for the * * "in" operator * * machpct is the threshhold percentage of matches that defines * * a "hit" * * dnafile is the name of the file containing the dna data * * logtrace turns on/off (Y/N) data tracing to the log * * gendata turns on/off (Y/N) the generation of random dna data * * set off (N) to read the real data * * NOTE: gendata=Y will overlay the contents of the * * &dnafile - for simulation, this is OK, for real * * data it is not. * *********************************************************************/ %let seglen=50; %let machstr='C','G'; %let machpct=50; %let dnafile=dna.tmp; %let logtrace=Y; %let gendata=Y; *********************************************************************; *********************************************************************;

/********************************************************************* * This datastep was used to generate data for testing. The bases * * A C G & T are selected randomly based on the probabilities pa, * * pc, pg, and pt. Change these to see what happens. * * * * The j loop controls how many bases per data record. The i loop * * controls how many records are written to the file. * *********************************************************************/ %macro gendata; %if &gendata=Y %then %do; %*-----define the probabilities for A C G, p(T) is calculated ----; /*%let problist=.25 .23 .23 0;*/ /*%let problist=.25 .28 .28 0;*/ %let problist=.25 .25 .25 0; options ls=100 ps=max; data _null_; file "&dnafile"; array tmp {*} $1 b1-b4 ('A' 'C' 'G' 'T'); array prob pa pc pg pt (&problist); pt=1-pa-pc-pg; do i=1 to 800; do j=1 to 20; base=tmp{ rantbl(87614256, pa , pc , pg , pt ) }; put @(2*j-1) base @; end; put; end; run; %end; %mend; %gendata;

/********************************************************************* * This is the datastep that reads the data and searches for a * * match. When it finds a segment that meets the criteria it * * writes the data to the "found" dataset. Of course it could be * * changed to suit, depending on what you want to do when you * * find a match. * *********************************************************************/ data found; infile "&dnafile" flowover eof=Jumpout; array tmp $1 b1-b4 ('A' 'C' 'G' 'T'); *--------------- arrays for counting matches, proportions; array prob pa pc pg pt; array count ka kc kg kt; array ttlprob ppa ppc ppg ppt; array ttlcnt kka kkc kkg kkt; length base $1; length dnaseg $&seglen; lastpos=1; *--------------- read the file into the string dnaseg; link ReLoad; *--------------- search the rest of the file; do while( 1 ); *---------------------------------------------------------------; * when match criteria are met, output the data, calc the ; * step length, and reload the dnsseg ; *---------------------------------------------------------------; pct=k/&seglen *100; if pct >= &machpct then do; pos=inkk -&seglen +1; step=pos-lastpos; output; do over count; count=0; end; lastpos=pos; link ReLoad; end; *---------------------------------------------------------------; * otherwise shift the dnaseg one byte left, increment, ; * decrement counters, and read the next datum ; *---------------------------------------------------------------; else do; if substr(dnaseg,1,1) in (&machstr) then k=k-1; do over count; if substr(dnaseg,1,1) = tmp then count=count-1; end; substr(dnaseg,1,1)=' '; dnaseg=left(dnaseg); input base @; inkk+1; do over count; if base=tmp then count+1; prob=count/&seglen; if base=tmp then ttlcnt+1; ttlprob=ttlcnt/inkk; end; if base in (&machstr) then k+1; substr(dnaseg,&seglen,1)=base; if "&logtrace"="Y" then do; pct=k/&seglen *100; put pct= dnaseg=; end; end; end; drop base i lastpos b1-b4; format pct 3.0; format pa pc pg pt 4.2; format ppa ppc ppg ppt 6.4; return; *---------------------------------------------------------------; * This link routine reads the data into the dnaseg var ; *---------------------------------------------------------------; ReLoad: *; k=0; dnaseg=' '; do i=1 to &seglen; input base @; inkk+1; do over count; if base=tmp then count+1; prob=count/&seglen; if base=tmp then ttlcnt+1; ttlprob=ttlcnt/inkk; end; if base in (&machstr) then k+1; substr(dnaseg,i,1)=base; if "&logtrace"="Y" then do; pct=k/i *100; put pct= dnaseg=; end; end; return; Jumpout: *; run;

/********************************************************************* * Print a simple report * *********************************************************************/ proc print data=found label; options ls=133 ps=78; %let title=DNA Segments With &machpct%nrbquote(%); %let title=&title or More Occurances of &machstr; title1 "&title"; title2 "Data From &dnafile, DNA Segment Length &seglen"; %*----------- this var stmt gives a basic rpt; /*var pos step k pct dnaseg;*/ %*----------- this includes the segment proportions; var pos step k pct pa pc pg pt dnaseg; %*----------- this includes the segment counts; /*var pos step k pct ka kc kg kt dnaseg;*/ %*----------- this includes the cummulative proportions; /*var pos step k pct ppa ppc ppg ppt dnaseg;*/ label k='Number of Matches' pct='Percent of Segment Matching' pos='Starting Position in Sequence' step='Distance From Last Match' dnaseg='DNA Segment' pa='p(A)' pc='p(C)' pg='p(G)' pt='p(T)' ppa='Cumm p(A)' ppc='Cumm p(C)' ppg='Cumm p(G)' ppt='Cumm p(T)' ka='Count A' kc='Count C' kg='Count G' kt='Count T'; run; ---------------------------------------------------------------------- Walt Smith | All models are wrong wjsmith1@fedex.com | Some models are useful

Sent via Deja.com http://www.deja.com/ Before you buy.


Back to: Top of message | Previous page | Main SAS-L page