Date: Thu, 7 Jul 2005 16:59:32 -0700
Reply-To: DavidL Cassell <davidlcassell@MSN.COM>
Sender: "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From: DavidL Cassell <davidlcassell@MSN.COM>
Subject: Re: fuzzy string search
In-Reply-To: <MC5-F37JkObUjglz1zC000a3ec4@mc5-f37.hotmail.com>
Content-Type: text/plain; format=flowed
IHolas@PPV.ORG wrote:
>How can I program SAS 9.1 to perform search on strings allowing some
>variations at the ends of it.. ?
>
>I have entries such as:
>
>food
>fastfood
>fast food
>etc,
>
>I want to code all of them as food.
>
>Is indexw in data step the way to go, or is there a better way to do it?
INDEXW() *might* be the way to go. It rather depends on what else
you want to do with your entries.
>SECOND: Does SAS allow for a true fuzzy string search e.g. recognizing
>"resturant" as "restaurant"?
SAS gives you a ton of options here. You can do true NDA pattern matching
with the RX... and PRX... functions. You can do Levenshtein edit distances
with COMPLEV() or generalized edit distances with COMPGED(). You can even
tweak the features of your 'fuzziness' with COMPGED() by using the CALL
COMPCOST() routine to alter the underlying scoring system. Then there's
SPEDIS, which computes a simpler 'spelling distance', and good old SOUNDEX()
as well. So you can make your searching as 'loose' or as 'tight' as you
want.
BTW,
soundex('restaurant') = soundex('resturant')
because SOUNDEX() essentially *ignores* all vowel groups, unless they're
the first letter of the word. SOUNDEX() was designed to link English-origin
names, so its utility is highly dependent on your list of task words.
HTH,
David
--
David L. Cassell
mathematical statistician
Design Pathways
3115 NW Norwood Pl.
Corvallis OR 97330
_________________________________________________________________
Is your PC infected? Get a FREE online computer virus scan from McAfeeŽ
Security. http://clinic.mcafee.com/clinic/ibuy/campaign.asp?cid=3963