|Date: ||Mon, 15 Jan 2007 22:06:40 +0100|
|Reply-To: ||Martin Gregory <gregorym@T-ONLINE.DE>|
|Sender: ||"SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>|
|From: ||Martin Gregory <gregorym@T-ONLINE.DE>|
|Subject: ||Re: detect dots as end of sentence|
|Content-Type: ||text/plain; charset=ISO-8859-1; format=flowed|
More general would be to use the regular expression that Emacs uses for
end of sentence:
[.?!]\"')]*($| $|\t| )[ \t\n]*
and use this instead of the \. in Alan's suggestion.
Alan Churchill wrote:
> I would use regular expressions here.
> Split the text using the following regex:
> That should give you what you need.
> Alan Churchill
> Savian "Bridging SAS and Microsoft Technologies"
> -----Original Message-----
> From: SAS(r) Discussion [mailto:SAS-L@LISTSERV.UGA.EDU] On Behalf Of Arjen
> Sent: Monday, January 15, 2007 9:04 AM
> To: SAS-L@LISTSERV.UGA.EDU
> Subject: detect dots as end of sentence
> Hi SAS-L,
> Please look at the tested code below. I am trying to split sentences in
> character strings by detecting the dot (.) at the end of a sentence. I
> encounter difficulties because there are numbers with dots in them. I
> figure two solutions:
> (i) Replace all numbers with numbers written down European-style: 4,31
> g 12,5 years - I have no clue
> (ii) Split sentences by searching for a dot and a space (. ); I tried
> to include a space in the code, but then I get a dataset with all
> separate words sorted out, which is not what I need.
> Any suggestions? Thanks.
> data SOURCE;
> x = "Daily intake of less than 4.31 g in people younger than 12.5 did
> not cause any harmful effects. I would highly recommend this drug."
> data SOURCE; set SOURCE; id+1; run;
> data need (drop = i);
> length y $5000;
> set source;
> do i = 1 to 100 while(scan(x,i,".") ne "");
> y = scan(x,i,".")||'.';