| Date: | Mon, 15 Jan 2007 22:06:40 +0100 |
| Reply-To: | Martin Gregory <gregorym@T-ONLINE.DE> |
| Sender: | "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU> |
| From: | Martin Gregory <gregorym@T-ONLINE.DE> |
| Organization: | T-Online |
| Subject: | Re: detect dots as end of sentence |
|
| In-Reply-To: | <007601c738c2$253db380$6fb91a80$@net> |
| Content-Type: | text/plain; charset=ISO-8859-1; format=flowed |
More general would be to use the regular expression that Emacs uses for
end of sentence:
[.?!][]\"')]*($| $|\t| )[ \t\n]*
and use this instead of the \. in Alan's suggestion.
-Martin
Alan Churchill wrote:
> Arjen,
>
> I would use regular expressions here.
>
> Split the text using the following regex:
>
> (?<=[a-z])\.
>
> That should give you what you need.
>
> Alan
>
> Alan Churchill
> Savian "Bridging SAS and Microsoft Technologies"
> www.savian.net
>
>
>
> -----Original Message-----
> From: SAS(r) Discussion [mailto:SAS-L@LISTSERV.UGA.EDU] On Behalf Of Arjen
> Sent: Monday, January 15, 2007 9:04 AM
> To: SAS-L@LISTSERV.UGA.EDU
> Subject: detect dots as end of sentence
>
> Hi SAS-L,
>
> Please look at the tested code below. I am trying to split sentences in
> character strings by detecting the dot (.) at the end of a sentence. I
> encounter difficulties because there are numbers with dots in them. I
> figure two solutions:
> (i) Replace all numbers with numbers written down European-style: 4,31
> g 12,5 years - I have no clue
> (ii) Split sentences by searching for a dot and a space (. ); I tried
> to include a space in the code, but then I get a dataset with all
> separate words sorted out, which is not what I need.
>
> Any suggestions? Thanks.
>
> Arjen
>
> data SOURCE;
> x = "Daily intake of less than 4.31 g in people younger than 12.5 did
> not cause any harmful effects. I would highly recommend this drug."
> ;
> run;
>
> data SOURCE; set SOURCE; id+1; run;
>
> data need (drop = i);
> length y $5000;
> set source;
> do i = 1 to 100 while(scan(x,i,".") ne "");
> y = scan(x,i,".")||'.';
> output;
> end;
> run;
|