Date: Thu, 8 Jan 2004 13:57:17 +0100
Reply-To: "Groeneveld, Jim" <jim.groeneveld@VITATRON.COM>
Sender: "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From: "Groeneveld, Jim" <jim.groeneveld@VITATRON.COM>
Subject: Re: Foreign Languages
Content-Type: text/plain; charset="iso-8859-1"
Hi David,
20 years ago I developed an automatic hyphenation program for the Dutch language, programmed in Fortran-4 (or 66) and called KAPAF. It was based on fuzzy logic (probability calculations) and a matrix of (estimated) hyphenation chances between letter pairs. It percentage of correct hyphenations was higher than 95%, and I had ideas for improving this figure. Without going into much detail, the choosen hyphenation point in a word was between the letter pair with the largest breaking probability. I made some experimental adaptations for the English language to the Dutch hyphenation matrix. It also seemed that the occurrence of letter pairs for both languages was distributed differently and I already had justified visions of letting the program itself determine the language in use.
But in order to be able to do so, quite some text should be analyzed. Such a method is not feasible with a few single words. But if only one language has been used in all your comments you should be able to apply such a strategy and statistically compare an English (or other) language distribution of letter pairs to that of your data. You could even disregard accented letters by reducing them to unaccented ones.
Well this could develop to a large project ..........
Regards - Jim.
--
. . . . . . . . . . . . . . . .
Jim Groeneveld, MSc.
Biostatistician
Science Team
Vitatron B.V.
Meander 1051
6825 MJ Arnhem
Tel: +31/0 26 376 7365
Fax: +31/0 26 376 7305
Jim.Groeneveld@Vitatron.com
www.vitatron.com
My computer has the solutions, I have the problems.
[common disclaimer]
-----Original Message-----
From: David Jackson [mailto:david.jackson@EUROPE.PPDI.COM]
Sent: Wednesday, January 07, 2004 17:53
To: SAS-L@LISTSERV.UGA.EDU
Subject: Foreign Languages
SAS-L
I'm expecting delivery of a data set that will contain a "Comments"
column.
My task is to search the comments and pick out any text that has been
written in a "foreign" language (not english).
My (very long) solution involves checking each field to see if it
contains any one of the following characters using the index() function.
... ... ... ÀÀÁÂÃÄÅÆÇãåæçèéêëìíîïñóõöûü ... ... ... (a subset of all
foreign characters)
Any ideas that might improve this
Thanks
Dave
_______________________________________________________
This e-mail transmission and any documents, files or previous email messages attached to it may contain information that is confidential or legally privileged. If you are not the intended recipient or a person responsible for delivering this transmission to the intended recipient, you are hereby notified that you must not read this transmission and that any disclosure, copying, printing, distribution or use of this transmission is strictly prohibited. If you have received this transmission in error, please immediately notify the sender by telephone or return email and delete the original transmission and its attachments without reading or saving in any manner.