LISTSERV at the University of Georgia
Menubar Imagemap
Home Browse Manage Request Manuals Register
Previous messageNext messagePrevious in topicNext in topicPrevious by same authorNext by same authorPrevious page (January 2004, week 2)Back to main SAS-L pageJoin or leave SAS-L (or change settings)ReplyPost a new messageSearchProportional fontNon-proportional font
Date:         Thu, 8 Jan 2004 13:57:17 +0100
Reply-To:     "Groeneveld, Jim" <jim.groeneveld@VITATRON.COM>
Sender:       "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From:         "Groeneveld, Jim" <jim.groeneveld@VITATRON.COM>
Subject:      Re: Foreign Languages
Comments: To: David Jackson <david.jackson@EUROPE.PPDI.COM>
Content-Type: text/plain; charset="iso-8859-1"

Hi David,

20 years ago I developed an automatic hyphenation program for the Dutch language, programmed in Fortran-4 (or 66) and called KAPAF. It was based on fuzzy logic (probability calculations) and a matrix of (estimated) hyphenation chances between letter pairs. It percentage of correct hyphenations was higher than 95%, and I had ideas for improving this figure. Without going into much detail, the choosen hyphenation point in a word was between the letter pair with the largest breaking probability. I made some experimental adaptations for the English language to the Dutch hyphenation matrix. It also seemed that the occurrence of letter pairs for both languages was distributed differently and I already had justified visions of letting the program itself determine the language in use.

But in order to be able to do so, quite some text should be analyzed. Such a method is not feasible with a few single words. But if only one language has been used in all your comments you should be able to apply such a strategy and statistically compare an English (or other) language distribution of letter pairs to that of your data. You could even disregard accented letters by reducing them to unaccented ones.

Well this could develop to a large project ..........

Regards - Jim. -- . . . . . . . . . . . . . . . .

Jim Groeneveld, MSc. Biostatistician Science Team Vitatron B.V. Meander 1051 6825 MJ Arnhem Tel: +31/0 26 376 7365 Fax: +31/0 26 376 7305 Jim.Groeneveld@Vitatron.com www.vitatron.com

My computer has the solutions, I have the problems.

[common disclaimer]

-----Original Message----- From: David Jackson [mailto:david.jackson@EUROPE.PPDI.COM] Sent: Wednesday, January 07, 2004 17:53 To: SAS-L@LISTSERV.UGA.EDU Subject: Foreign Languages

SAS-L

I'm expecting delivery of a data set that will contain a "Comments" column.

My task is to search the comments and pick out any text that has been written in a "foreign" language (not english).

My (very long) solution involves checking each field to see if it contains any one of the following characters using the index() function.

... ... ... ÀÀÁÂÃÄÅÆÇãåæçèéêëìíîïñóõöûü ... ... ... (a subset of all foreign characters)

Any ideas that might improve this

Thanks

Dave

_______________________________________________________ This e-mail transmission and any documents, files or previous email messages attached to it may contain information that is confidential or legally privileged. If you are not the intended recipient or a person responsible for delivering this transmission to the intended recipient, you are hereby notified that you must not read this transmission and that any disclosure, copying, printing, distribution or use of this transmission is strictly prohibited. If you have received this transmission in error, please immediately notify the sender by telephone or return email and delete the original transmission and its attachments without reading or saving in any manner.


Back to: Top of message | Previous page | Main SAS-L page