|Date: ||Thu, 10 Mar 2011 14:19:05 -0800|
|Reply-To: ||"Sprague, Webb (OFM)" <Webb.Sprague@OFM.WA.GOV>|
|Sender: ||"SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>|
|From: ||"Sprague, Webb (OFM)" <Webb.Sprague@OFM.WA.GOV>|
|Subject: ||Re: "Unpacking" variable in a datastep into a "macro function"|
|Content-Type: ||text/plain; charset="us-ascii"|
I am not sure why you think I needed help with RE's (my question was
about macros) but for the sake of all our edification, I have replied
Note that these RE's are the last attempt to filter out a housing type
when other, more deterministic, approaches have already failed. So I am
shooting for fuzzy.
> %SysFunc( PRXMatch( /...../ , <VarName> ) )
> Only works in 9.2 and higher, in prior version the PrxMatch function
> when used in teh macro facility assumed the pattern was already
> precompiled via a PRXparse function call.
I couldnt get it to work, and decided to go with a simpler approach.
> Id loose the * and repace it with a +, Im pretty sure the 'E' and 'O'
No, the * is what I want, because often when people abbreviate they
leave out the vowels. RMDL => Remodel. REEMODEL => Remodel. But
RAMODIL probably not Remodel.
> in RE*MO*DE*L for example aren't optional.
> Rather the poster is wanting atleast one or more occurances of these.
> If it is optional then I would use the ?.
> The * says 0 or more, which more often than not is not exactly what
> coder wants. In short the * is one
> of the most over used and misunderstood Quantifiers.
Misunderstood by some people, but not by me. ;)
> Depending on how many alternatives are to searched for, I would be
> tempted to loose the capturing parens or use
> non-capturing parens, and if there is to be a preference which one is
> to be matched first I would look at the order
> the alternatives are specified in the pattern.
Why lose the parens? I thought you needed them with alteration "|". If
not, yes they should go, as I am not substituing them in. putting most
frequent first is a good idea.
> If distinct words are to be matched and not parts of a word I would
> the \b word boundary metasequences.
Yeah, I know how that works too. I compress, delete punctuation, and
thus \b doesn't apply.
> As it is now it could match REMODEL, NONREMODEL, REMODELED.... you
> the idea.
Thats what we want. I am glad you confirm. ;) NONREMODELBUTREALLYSFR
would fall through the cracks, but it is unlikely.
> If there is only one word and not a bunch of words in the Target
> I would add the ^ and $ line anchors to allow
> the Reg Ex engine optmizer to take over.
Again, I am looking for the words anywhere in the string, so I don't use
anchors (except once).
> I definitly add the /o pattern modifier to the RegEx pattern.
I will look into that
> Finally, it may be faster to break the alternatives down into a series
> of PrxMatch function calls rather than one big
> honker pattern and function call. This however, takes knowing ones
> data, what is to and not to be matched.
Actually, I know my data fairly well, ... thanks. I keep the REs as big
honkers because lots of different input (SFR, SINGLEFAM, SINGFAM, etc =>
SFR) all yield the same result, and I like the way it is organized this
way. Like you say, the most common in the bunch near the front would be