LISTSERV at the University of Georgia
Menubar Imagemap
Home Browse Manage Request Manuals Register
Previous messageNext messagePrevious in topicNext in topicPrevious by same authorNext by same authorPrevious page (July 2005, week 1)Back to main SAS-L pageJoin or leave SAS-L (or change settings)ReplyPost a new messageSearchProportional fontNon-proportional font
Date:         Thu, 7 Jul 2005 16:59:32 -0700
Reply-To:     DavidL Cassell <davidlcassell@MSN.COM>
Sender:       "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From:         DavidL Cassell <davidlcassell@MSN.COM>
Subject:      Re: fuzzy string search
In-Reply-To:  <MC5-F37JkObUjglz1zC000a3ec4@mc5-f37.hotmail.com>
Content-Type: text/plain; format=flowed

IHolas@PPV.ORG wrote: >How can I program SAS 9.1 to perform search on strings allowing some >variations at the ends of it.. ? > >I have entries such as: > >food >fastfood >fast food >etc, > >I want to code all of them as food. > >Is indexw in data step the way to go, or is there a better way to do it?

INDEXW() *might* be the way to go. It rather depends on what else you want to do with your entries.

>SECOND: Does SAS allow for a true fuzzy string search e.g. recognizing >"resturant" as "restaurant"?

SAS gives you a ton of options here. You can do true NDA pattern matching with the RX... and PRX... functions. You can do Levenshtein edit distances with COMPLEV() or generalized edit distances with COMPGED(). You can even tweak the features of your 'fuzziness' with COMPGED() by using the CALL COMPCOST() routine to alter the underlying scoring system. Then there's SPEDIS, which computes a simpler 'spelling distance', and good old SOUNDEX() as well. So you can make your searching as 'loose' or as 'tight' as you want.

BTW, soundex('restaurant') = soundex('resturant') because SOUNDEX() essentially *ignores* all vowel groups, unless they're the first letter of the word. SOUNDEX() was designed to link English-origin names, so its utility is highly dependent on your list of task words.

HTH, David -- David L. Cassell mathematical statistician Design Pathways 3115 NW Norwood Pl. Corvallis OR 97330

_________________________________________________________________ Is your PC infected? Get a FREE online computer virus scan from McAfeeŽ Security. http://clinic.mcafee.com/clinic/ibuy/campaign.asp?cid=3963


Back to: Top of message | Previous page | Main SAS-L page