LISTSERV at the University of Georgia
Menubar Imagemap
Home Browse Manage Request Manuals Register
Previous messageNext messagePrevious in topicNext in topicPrevious by same authorNext by same authorPrevious page (March 2005, week 3)Back to main SAS-L pageJoin or leave SAS-L (or change settings)ReplyPost a new messageSearchProportional fontNon-proportional font
Date:         Wed, 16 Mar 2005 13:52:01 -0800
Reply-To:     cassell.david@EPAMAIL.EPA.GOV
Sender:       "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From:         "David L. Cassell" <cassell.david@EPAMAIL.EPA.GOV>
Subject:      Re: What is the regular expression for SAS format name?
In-Reply-To:  <200503161907.j2GJ7Zen016234@listserv.cc.uga.edu>
Content-type: text/plain; charset=US-ASCII

"Chang Y. Chung" <chang_y_chung@hotmail.com> replied to Mike: > Not that I am a regular head,

How about a(n) RX-head? We already have SQL-heads around here.

> but another name for the "non-capturing > parentheses" is "Grouping-only parentheses."

Or think of it as 'clustering'.

It is one of several Perl regex extensions, many of which will work in the PRX() functions, which start with "(?".

> It has three character openning > sequence and one ending right parenthesis , i.e. (?: ... ) , where ... > represents your pattern to be grouped but not "captured." According to > Friedl(2002, "the Owl book" by OrReilly), it is useful because:

We hipsters refer to Jeff Friedl's book as the 'hip owls' book, as is more obvious if you look at the cover illustration.

> (1) helps building up a regex from parts; > (2) cleaner since the reader doesn't need to wonder if what's matched by > what they group is accessed elsewhere by $1, $2, ... > (3) can be more efficient (Friedl 2002: 136) > > But he says on other page: > > "On the other hand, the (?: ... ) notation is somewhat unsightly, and > perhaps makes the expression more difficult to grasp at a glance. Are the > benefits worth it? Well, personally, I tend to use exactly the kind of > parentheses I need ...." (Friedl 2002:45) > > I think it appeared in Perl 5 and said to be the invention of Larry himself.

All the (?...) features appeared in Perl 5, along with a vast ton of other stuff.

My personal feeling is that the (?:matchingstuff) functionality works best when you do needto capture other components of the string, and you don't want to have to figure out which things as keepers and which you want to throw back.

In particular, the CALL PRXNEXT() routine is convenient in SAS when you have a sequence {1,2,3,...} of buffers that you are going to call in order using a do loop. If you have undesired segments, say an alternation, you can use the (?:foo) clustering so this would NOT be one of the captured buffers and you would NOT have to fiddle with the iterator of the do loop to get the right captured buffer information.

> There are also other constructs like (?<name> ... ), which is not available > in perl, and (?> ... ) (called atomic grouping -- I don't know what this is!)

Assume your pattern to be matched is 'foo'. The constructs you *can* use in the PRX() functions are the following. These are called 'zero-width assertions' because they don't actually use up any characters in the text string you're parsing.

(?=foo) zero-width positive lookahead (?!foo) zero-width negative lookahead (?<=foo) zero-width positive lookbehind (?<!foo) zero-width negative lookbehind

The '?' was chosen as the modifying character for these because it was already being used as a modifier on the quantifiers *, +, ?, and {n,m}. As we already know, these quantities are 'greedy'. Matching is done leftmost first, and then sucking up as much as is possible. This means that we get problems like:

string = 'The cat, the rat, and lovell our dog all rule england under the hog.' (Yes, I took a small literary license here. So sue me.)

Then the pattern /c.*t/ will match.. not "cat" but:

string = 'The cat, the rat, and lovell our dog all rule england under the hog.' ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Yep, everything from the first 'c' to the last 't'. So we can use the '?' modifier to make these non-greedy, matching as little as possible while still making the rest of the pattern work. So the pattern /c.*?t/ would in fact match only 'cat' here.

If you want to see what you can do with these zero-width assertions, look up the pattern match I did a few months ago in SAS-L, using one of them. It's an example where you HAVE to have a zero-width assertion to get you out of a bind.

If you know who the Cat, the Rat, Lovell our dog, and the Hog are, give yourself three brownie points. Ten if you're American.

David -- David Cassell, CSC Cassell.David@epa.gov Senior computing specialist mathematical statistician


Back to: Top of message | Previous page | Main SAS-L page