LISTSERV at the University of Georgia
Menubar Imagemap
Home Browse Manage Request Manuals Register
Previous messageNext messagePrevious in topicNext in topicPrevious by same authorNext by same authorPrevious page (January 2001, week 4)Back to main SAS-L pageJoin or leave SAS-L (or change settings)ReplyPost a new messageSearchProportional fontNon-proportional font
Date:         Thu, 25 Jan 2001 14:48:27 -0800
Reply-To:     "Huang, Ya" <ya.huang@AGOURON.COM>
Sender:       "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From:         "Huang, Ya" <ya.huang@AGOURON.COM>
Subject:      Re: nasty text processing puzzle: SAS or Perl?
Comments: To: VStevens <lilybear.nospam@BELLSOUTH.NET>
Content-Type: multipart/alternative;

Hi,

Here is a solution, it might be slow though:

** testing data set; data xx; length comment $ 200; comment='xx yyy john smith zzz kkkk'; output; comment='www jjjj alan ppp'; output; ;

** name list data set which has the names; ** need to be removed; data nlist; length short $ 10; short='alan'; output; short='john'; output; short='smith'; output;

** use sql to create a macro, which will repeatedly; ** call tranwrd() function to replace names with a blank;

proc sql; select "comment=tranwrd(comment,'"||compress(short)||"','');" into : repl seperated by ' ' from nlist ; quit;

data xx; set xx; &repl; run;

options nocenter; proc print; run;

-------------------------------- The SAS System 12:52 Thursday, January 25, 2001 52

Obs comment

1 xx yyy zzz kkkk 2 www jjjj ppp

It is logically and syntaxly simpler, the drawback is its efficiency, for each observation, 99% of the tranwrd() will find nothing to replace, so there are wasting time. But you have only 800 names, I guess it is still doable.

Regards,

Ya Huang

-----Original Message----- From: VStevens [mailto:lilybear.nospam@BELLSOUTH.NET] Sent: Thursday, January 25, 2001 3:52 AM To: SAS-L@LISTSERV.UGA.EDU Subject: nasty text processing puzzle: SAS or Perl?

We want to remove names and nasty words from verbatim comments on some surveys. The comment fields are $6000. We're using V8 (8.2 if I can arrange it soon)

We have a list of names from an employee file. We may also search for vulgar words etc.... I'm thinking that George Carlin's list would be useful here :-)

There have been a couple of tragic stabs at this in sql. crash and burn time. I thought it might be fun to try the following...

* get unique names (as words). so I end up with a list of single words, both surnames and firstnames. * get all unique words from the comment fields * match these and get a list of names that actually exist in the data.

*** THEN, run some kind of search and replace drill through the verbatims word by word.

This has the benefit of not running all 70k names through all the verbatims, only the names or curses we know are actually there (800 or so).

The processing to get the unique words was easily handled by this nifty tool I have called TextPipe. http://www.crystalsoftware.com.au/ Just dumped the verbatim fields from SAS to txt (though it has some dbms connect ability that I haven't messed with yet)

So when it comes to the *** THEN .... step...I still don't know a good way to rip through the comments in SAS. Array processing by substringin and using input function is first that comes to mind, but sounds really messy. Loading all 800 words from the checklist into macro vars has been suggested, then generating 800 if statements and using indexc function.

I suspect that PERL might offer is the answer, but can't quite zero in on the way to do it (being about 24 hrs into PERL). Seems like an iterative grep of some sort. (or shoving it through a hash table word by word, but I don't necessarily want to replace the word, maybe stick some kind of special character on the end to search for.

I guess I want the record flagged, not text replaced. I think we should have a person decide to suppress or not (but would like to flag records to review). AND... this survey will be done over around 80,000 people, and in several languages, including asian languages. So whatever we do, it has to scale. Fortunately, this processing won't be done all at once, but will trickle in. Still we don't want this to bog the thing down.

Personally I think its way too much processing not to come out of it with some nice abstract of the comment that assists in getting MEANING out of the responses... This is so crude... not exactly knowledge management! There are some text mining tools that would do it, but this group is not necessarily for a warp speed transition to state of the art.

Any ideas would be appreciated!

V. Stevens


[text/html]


Back to: Top of message | Previous page | Main SAS-L page