| Date: | Thu, 10 Mar 2011 14:19:05 -0800 |
| Reply-To: | "Sprague, Webb (OFM)" <Webb.Sprague@OFM.WA.GOV> |
| Sender: | "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU> |
| From: | "Sprague, Webb (OFM)" <Webb.Sprague@OFM.WA.GOV> |
| Subject: | Re: "Unpacking" variable in a datastep into a "macro function" |
|
| In-Reply-To: | A<BLU152-w3022409EA1EF0525C12DB6DEC80@phx.gbl> |
| Content-Type: | text/plain; charset="us-ascii" |
I am not sure why you think I needed help with RE's (my question was
about macros) but for the sake of all our edification, I have replied
below.
Note that these RE's are the last attempt to filter out a housing type
when other, more deterministic, approaches have already failed. So I am
shooting for fuzzy.
> %SysFunc( PRXMatch( /...../ , <VarName> ) )
>
> Only works in 9.2 and higher, in prior version the PrxMatch function
> when used in teh macro facility assumed the pattern was already
> precompiled via a PRXparse function call.
I couldnt get it to work, and decided to go with a simpler approach.
>
/(RE*MO*DE*L)|(A*LTE*RA*T[IO]*N*)|(ADDI*T[IO]*)|(REHAB)|(REROOF)|(REMOV
> E)/
>
> Id loose the * and repace it with a +, Im pretty sure the 'E' and 'O'
No, the * is what I want, because often when people abbreviate they
leave out the vowels. RMDL => Remodel. REEMODEL => Remodel. But
RAMODIL probably not Remodel.
> in RE*MO*DE*L for example aren't optional.
> Rather the poster is wanting atleast one or more occurances of these.
> If it is optional then I would use the ?.
> The * says 0 or more, which more often than not is not exactly what
the
> coder wants. In short the * is one
> of the most over used and misunderstood Quantifiers.
Misunderstood by some people, but not by me. ;)
> Depending on how many alternatives are to searched for, I would be
> tempted to loose the capturing parens or use
> non-capturing parens, and if there is to be a preference which one is
> to be matched first I would look at the order
> the alternatives are specified in the pattern.
Why lose the parens? I thought you needed them with alteration "|". If
not, yes they should go, as I am not substituing them in. putting most
frequent first is a good idea.
> If distinct words are to be matched and not parts of a word I would
add
> the \b word boundary metasequences.
Yeah, I know how that works too. I compress, delete punctuation, and
thus \b doesn't apply.
> As it is now it could match REMODEL, NONREMODEL, REMODELED.... you
get
> the idea.
Thats what we want. I am glad you confirm. ;) NONREMODELBUTREALLYSFR
would fall through the cracks, but it is unlikely.
> If there is only one word and not a bunch of words in the Target
String
> I would add the ^ and $ line anchors to allow
> the Reg Ex engine optmizer to take over.
Again, I am looking for the words anywhere in the string, so I don't use
anchors (except once).
> I definitly add the /o pattern modifier to the RegEx pattern.
I will look into that
> Finally, it may be faster to break the alternatives down into a series
> of PrxMatch function calls rather than one big
> honker pattern and function call. This however, takes knowing ones
> data, what is to and not to be matched.
Actually, I know my data fairly well, ... thanks. I keep the REs as big
honkers because lots of different input (SFR, SINGLEFAM, SINGFAM, etc =>
SFR) all yield the same result, and I like the way it is organized this
way. Like you say, the most common in the bunch near the front would be
best.
|