LISTSERV at the University of Georgia
Menubar Imagemap
Home Browse Manage Request Manuals Register
Previous messageNext messagePrevious in topicNext in topicPrevious by same authorNext by same authorPrevious page (January 2007, week 3)Back to main SAS-L pageJoin or leave SAS-L (or change settings)ReplyPost a new messageSearchProportional fontNon-proportional font
Date:         Wed, 17 Jan 2007 14:31:24 -0500
Reply-To:     Ken Borowiak <EvilPettingZoo97@AOL.COM>
Sender:       "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From:         Ken Borowiak <EvilPettingZoo97@AOL.COM>
Subject:      Re: Extracting word(s) occurring in text before a certain keyword

On Mon, 15 Jan 2007 06:01:38 -0500, hakanener99@YAHOO.COM wrote:

>Hello, > > Thanks for the tips so far! In response to Ken's suggestion, I'm posting a >sample observation, which is typical of what I have (the person's name was >disguised for anonymity). > > The objective is to extract the company names that precede the following >keywords: "Inc." "Corp." "Co." "Corporation" "Pharmaceuticals" (this last >keyword may also be mentioned with reference to the industry itself, rather >than a specific company. The only way to distinguish is whether the >preceding words are capitalized and in the same sentence). So, the result >would be a horizontal array of cells that list (for the following >observation): Celltech, ALZA, SEQUUS. > > You may notice that there is another company in the observation (Hoffman- >La Roche) which is not followed by any of those keywords. This is a >complication in the dataset. It is only possible to >extract this company's name and add it to the results with the help of a >reference list, such as a > directory of all available company names to use as a reference. For now, >imagine I had such a >directory and that I am only interested in firms listed in that directory >(I can create a directory through major >business databases such as CompuSTAT and extract the >identifying words in the same way that I want to do >here, so that I have a huge list that mentions companies in many >industries, such as "Coca Cola" >"Pepsi" "Hoffman-La Roche" and so on. I'd extract that list without the >Inc. or Corp. type keywords so that I can use to match any company names >mentioned in my original dataset where a company may be mentioned without >Inc. and Corp. ) > > I was not able to see how to choose among or combine suggestions that >mentioned PRX functions with Index and Substring type data processing, >especially when a master list of company names must also be consulted (as >in the case of Hoffman-La Roche) to complete the task. > > Hakan >France >

Hakan,

Thanks for posting a sample observation.

/*- Sample Observation -*/ data foo ; length desc $4000 ; desc="Joe Anonymous Dingaling has been the Chief Executive Officer and President of CellGate Inc. since July 2002. He joined CellGate in September 2001 as Executive Vice President. Prior to that, he served as Vice President of Clinical Development at ALZA Corporation, a specialty pharmaceutical company. Prior to his role at ALZA, he served as Senior Vice President and Medical Director at SEQUUS Pharmaceuticals, where he was responsible for clinical and regulatory functions as a member of the Executive Committee. Between 1983 and 1996, he held various positions at Hoffmann-La Roche,including Vice President of Clinical Operations, Virology. He has been the Director of CellGate Inc. since July 2002. He is a member of the American Society of Clinical Oncology and the American Academy of Pharmaceutical Physicians, among others. He holds a B.S. degree from Hobart College and an M.D. from the Georgetown University School of Medicine in Washington, D.C. He trained in medicine at the Long Island Jewish-Hillside Medical Center and completed fellowships in hematology at New York University and in medical oncology at Memorial Sloan- Kettering Cancer Center in New York"; run ;

/*-- Match all cases, will need to transpose --*/ data bar ; _re=prxparse('/(?:\b[A-Z][\w-]+\s*)+(?=Inc|Co(?:rp|mp)|Pharma)/') ; if missing(_re) then stop ; length company $50 ;

do until( eof ) ; set foo end=eof ; start = 1; stop = length( desc ); call prxnext(_re, start, stop, desc, position, length); do while (position > 0); company = substr(desc, position, length) ; output ; call prxnext(_re, start, stop, desc, position, length) ; end; end ; stop ; run;

The regular expression looks for a series of 'words' that begin with a capital letter and separated by 0 or more spaces, then followed by 'Inc', 'Corp', 'Comp' or 'Pharma'. Non-capturing buffers (?:) are used for memory conservation purposes and a positive lookahead (?=) to exclude the reference to 'Inc', etc.

This should get you close to your desired output.

HTH, Ken


Back to: Top of message | Previous page | Main SAS-L page