Date: Wed, 16 Mar 2005 13:52:01 -0800
Reply-To: cassell.david@EPAMAIL.EPA.GOV
Sender: "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From: "David L. Cassell" <cassell.david@EPAMAIL.EPA.GOV>
Subject: Re: What is the regular expression for SAS format name?
In-Reply-To: <200503161907.j2GJ7Zen016234@listserv.cc.uga.edu>
Content-type: text/plain; charset=US-ASCII
"Chang Y. Chung" <chang_y_chung@hotmail.com> replied to Mike:
> Not that I am a regular head,
How about a(n) RX-head? We already have SQL-heads around here.
> but another name for the "non-capturing
> parentheses" is "Grouping-only parentheses."
Or think of it as 'clustering'.
It is one of several Perl regex extensions, many of which will work
in the PRX() functions, which start with "(?".
> It has three character
openning
> sequence and one ending right parenthesis , i.e. (?: ... ) , where ...
> represents your pattern to be grouped but not "captured." According to
> Friedl(2002, "the Owl book" by OrReilly), it is useful because:
We hipsters refer to Jeff Friedl's book as the 'hip owls' book,
as is more obvious if you look at the cover illustration.
> (1) helps building up a regex from parts;
> (2) cleaner since the reader doesn't need to wonder if what's matched
by
> what they group is accessed elsewhere by $1, $2, ...
> (3) can be more efficient (Friedl 2002: 136)
>
> But he says on other page:
>
> "On the other hand, the (?: ... ) notation is somewhat unsightly, and
> perhaps makes the expression more difficult to grasp at a glance. Are
the
> benefits worth it? Well, personally, I tend to use exactly the kind of
> parentheses I need ...." (Friedl 2002:45)
>
> I think it appeared in Perl 5 and said to be the invention of Larry
himself.
All the (?...) features appeared in Perl 5, along with a vast ton of
other stuff.
My personal feeling is that the (?:matchingstuff) functionality works
best when you do needto capture other components of the string, and you
don't want to have to figure out which things as keepers and which you
want to throw back.
In particular, the CALL PRXNEXT() routine is convenient in SAS when you
have a sequence {1,2,3,...} of buffers that you are going to call in
order using a do loop. If you have undesired segments, say an
alternation,
you can use the (?:foo) clustering so this would NOT be one of the
captured
buffers and you would NOT have to fiddle with the iterator of the do
loop
to get the right captured buffer information.
> There are also other constructs like (?<name> ... ), which is not
available
> in perl, and (?> ... ) (called atomic grouping -- I don't know what
this is!)
Assume your pattern to be matched is 'foo'. The constructs you *can*
use in
the PRX() functions are the following. These are called 'zero-width
assertions'
because they don't actually use up any characters in the text string
you're
parsing.
(?=foo) zero-width positive lookahead
(?!foo) zero-width negative lookahead
(?<=foo) zero-width positive lookbehind
(?<!foo) zero-width negative lookbehind
The '?' was chosen as the modifying character for these because it was
already
being used as a modifier on the quantifiers *, +, ?, and {n,m}.
As we already know, these quantities are 'greedy'. Matching is done
leftmost
first, and then sucking up as much as is possible. This means that we
get
problems like:
string = 'The cat, the rat, and lovell our dog all rule england under
the hog.'
(Yes, I took a small literary license here. So sue me.)
Then the pattern /c.*t/ will match.. not "cat" but:
string = 'The cat, the rat, and lovell our dog all rule england under
the hog.'
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Yep, everything from the first 'c' to the last 't'. So we can use the
'?'
modifier to make these non-greedy, matching as little as possible while
still
making the rest of the pattern work. So the pattern /c.*?t/ would in
fact
match only 'cat' here.
If you want to see what you can do with these zero-width assertions,
look
up the pattern match I did a few months ago in SAS-L, using one of them.
It's an example where you HAVE to have a zero-width assertion to get you
out of a bind.
If you know who the Cat, the Rat, Lovell our dog, and the Hog are, give
yourself three brownie points. Ten if you're American.
David
--
David Cassell, CSC
Cassell.David@epa.gov
Senior computing specialist
mathematical statistician