Date: Thu, 8 Jan 2004 13:57:17 +0100
Reply-To: "Groeneveld, Jim" <jim.groeneveld@VITATRON.COM>
Sender: "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From: "Groeneveld, Jim" <jim.groeneveld@VITATRON.COM>
Subject: Re: Foreign Languages
Content-Type: text/plain; charset="iso-8859-1"
20 years ago I developed an automatic hyphenation program for the Dutch language, programmed in Fortran-4 (or 66) and called KAPAF. It was based on fuzzy logic (probability calculations) and a matrix of (estimated) hyphenation chances between letter pairs. It percentage of correct hyphenations was higher than 95%, and I had ideas for improving this figure. Without going into much detail, the choosen hyphenation point in a word was between the letter pair with the largest breaking probability. I made some experimental adaptations for the English language to the Dutch hyphenation matrix. It also seemed that the occurrence of letter pairs for both languages was distributed differently and I already had justified visions of letting the program itself determine the language in use.
But in order to be able to do so, quite some text should be analyzed. Such a method is not feasible with a few single words. But if only one language has been used in all your comments you should be able to apply such a strategy and statistically compare an English (or other) language distribution of letter pairs to that of your data. You could even disregard accented letters by reducing them to unaccented ones.
Well this could develop to a large project ..........
Regards - Jim.
. . . . . . . . . . . . . . . .
Jim Groeneveld, MSc.
6825 MJ Arnhem
Tel: +31/0 26 376 7365
Fax: +31/0 26 376 7305
My computer has the solutions, I have the problems.
From: David Jackson [mailto:david.jackson@EUROPE.PPDI.COM]
Sent: Wednesday, January 07, 2004 17:53
Subject: Foreign Languages
I'm expecting delivery of a data set that will contain a "Comments"
My task is to search the comments and pick out any text that has been
written in a "foreign" language (not english).
My (very long) solution involves checking each field to see if it
contains any one of the following characters using the index() function.
... ... ... ÀÀÁÂÃÄÅÆÇãåæçèéêëìíîïñóõöûü ... ... ... (a subset of all
Any ideas that might improve this
This e-mail transmission and any documents, files or previous email messages attached to it may contain information that is confidential or legally privileged. If you are not the intended recipient or a person responsible for delivering this transmission to the intended recipient, you are hereby notified that you must not read this transmission and that any disclosure, copying, printing, distribution or use of this transmission is strictly prohibited. If you have received this transmission in error, please immediately notify the sender by telephone or return email and delete the original transmission and its attachments without reading or saving in any manner.