Date: Wed, 17 Jan 2007 14:31:24 -0500
Reply-To: Ken Borowiak <EvilPettingZoo97@AOL.COM>
Sender: "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From: Ken Borowiak <EvilPettingZoo97@AOL.COM>
Subject: Re: Extracting word(s) occurring in text before a certain keyword
On Mon, 15 Jan 2007 06:01:38 -0500, hakanener99@YAHOO.COM wrote:
>Hello,
>
> Thanks for the tips so far! In response to Ken's suggestion, I'm posting a
>sample observation, which is typical of what I have (the person's name was
>disguised for anonymity).
>
> The objective is to extract the company names that precede the following
>keywords: "Inc." "Corp." "Co." "Corporation" "Pharmaceuticals" (this last
>keyword may also be mentioned with reference to the industry itself, rather
>than a specific company. The only way to distinguish is whether the
>preceding words are capitalized and in the same sentence). So, the result
>would be a horizontal array of cells that list (for the following
>observation): Celltech, ALZA, SEQUUS.
>
> You may notice that there is another company in the observation (Hoffman-
>La Roche) which is not followed by any of those keywords. This is a
>complication in the dataset. It is only possible to
>extract this company's name and add it to the results with the help of a
>reference list, such as a
> directory of all available company names to use as a reference. For now,
>imagine I had such a
>directory and that I am only interested in firms listed in that directory
>(I can create a directory through major
>business databases such as CompuSTAT and extract the
>identifying words in the same way that I want to do
>here, so that I have a huge list that mentions companies in many
>industries, such as "Coca Cola"
>"Pepsi" "Hoffman-La Roche" and so on. I'd extract that list without the
>Inc. or Corp. type keywords so that I can use to match any company names
>mentioned in my original dataset where a company may be mentioned without
>Inc. and Corp. )
>
> I was not able to see how to choose among or combine suggestions that
>mentioned PRX functions with Index and Substring type data processing,
>especially when a master list of company names must also be consulted (as
>in the case of Hoffman-La Roche) to complete the task.
>
> Hakan
>France
>
Hakan,
Thanks for posting a sample observation.
/*- Sample Observation -*/
data foo ;
length desc $4000 ;
desc="Joe Anonymous Dingaling has been the Chief Executive
Officer and President of CellGate Inc. since July
2002. He joined CellGate in September 2001
as Executive Vice President. Prior to that, he
served as Vice President of Clinical
Development at ALZA Corporation, a specialty
pharmaceutical company. Prior to his role at ALZA, he
served as Senior Vice President and Medical
Director at SEQUUS Pharmaceuticals, where he was
responsible for clinical and regulatory functions as a
member of the Executive Committee. Between 1983 and
1996, he held various positions at
Hoffmann-La Roche,including Vice President of Clinical
Operations, Virology. He has been the Director of
CellGate Inc. since July 2002. He is a member of the
American Society of Clinical Oncology and the American
Academy of Pharmaceutical Physicians, among others.
He holds a B.S. degree from Hobart College
and an M.D. from the Georgetown University School of
Medicine in Washington, D.C. He trained in medicine at
the Long Island Jewish-Hillside Medical Center and
completed fellowships in hematology at New York
University and in medical oncology at Memorial Sloan-
Kettering Cancer Center in New York";
run ;
/*-- Match all cases, will need to transpose --*/
data bar ;
_re=prxparse('/(?:\b[A-Z][\w-]+\s*)+(?=Inc|Co(?:rp|mp)|Pharma)/') ;
if missing(_re) then stop ;
length company $50 ;
do until( eof ) ;
set foo end=eof ;
start = 1;
stop = length( desc );
call prxnext(_re, start, stop, desc, position, length);
do while (position > 0);
company = substr(desc, position, length) ;
output ;
call prxnext(_re, start, stop, desc, position, length) ;
end;
end ;
stop ;
run;
The regular expression looks for a series of 'words' that begin with a
capital letter and separated by 0 or more spaces, then followed by 'Inc',
'Corp', 'Comp' or 'Pharma'. Non-capturing buffers (?:) are used for memory
conservation purposes and a positive lookahead (?=) to exclude the reference
to 'Inc', etc.
This should get you close to your desired output.
HTH,
Ken
|