Date: Thu, 25 Jan 2001 14:48:27 -0800
Reply-To: "Huang, Ya" <ya.huang@AGOURON.COM>
Sender: "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From: "Huang, Ya" <ya.huang@AGOURON.COM>
Subject: Re: nasty text processing puzzle: SAS or Perl?
Here is a solution, it might be slow though:
** testing data set;
length comment $ 200;
comment='xx yyy john smith zzz kkkk';
comment='www jjjj alan ppp';
** name list data set which has the names;
** need to be removed;
length short $ 10;
** use sql to create a macro, which will repeatedly;
** call tranwrd() function to replace names with a blank;
select "comment=tranwrd(comment,'"||compress(short)||"','');" into : repl
seperated by ' '
The SAS System 12:52 Thursday, January 25, 2001 52
1 xx yyy zzz kkkk
2 www jjjj ppp
It is logically and syntaxly simpler, the drawback is its
efficiency, for each observation, 99% of the tranwrd()
will find nothing to replace, so there are wasting time.
But you have only 800 names, I guess it is still doable.
From: VStevens [mailto:lilybear.nospam@BELLSOUTH.NET]
Sent: Thursday, January 25, 2001 3:52 AM
Subject: nasty text processing puzzle: SAS or Perl?
We want to remove names and nasty words from verbatim comments on some
surveys. The comment fields are $6000. We're using V8 (8.2 if I can
arrange it soon)
We have a list of names from an employee file. We may also search for
vulgar words etc.... I'm thinking that George Carlin's list would be useful
There have been a couple of tragic stabs at this in sql. crash and burn
I thought it might be fun to try the following...
* get unique names (as words). so I end up with a list of single words,
both surnames and firstnames.
* get all unique words from the comment fields
* match these and get a list of names that actually exist in the data.
*** THEN, run some kind of search and replace drill through the
verbatims word by word.
This has the benefit of not running all 70k names through all the verbatims,
only the names or curses we know are actually there (800 or so).
The processing to get the unique words was easily handled by this nifty tool
I have called TextPipe. http://www.crystalsoftware.com.au/
Just dumped the verbatim fields from SAS to txt (though it has some dbms
connect ability that I haven't messed with yet)
So when it comes to the *** THEN .... step...I still don't know a good
way to rip through the comments in SAS. Array processing by substringin and
using input function is first that comes to mind, but sounds really messy.
Loading all 800 words from the checklist into macro vars has been suggested,
then generating 800 if statements and using indexc function.
I suspect that PERL might offer is the answer, but can't quite zero in on
the way to do it (being about 24 hrs into PERL). Seems like an iterative
grep of some sort. (or shoving it through a hash table word by word, but I
don't necessarily want to replace the word, maybe stick some kind of special
character on the end to search for.
I guess I want the record flagged, not text replaced. I think we should
have a person decide to suppress or not (but would like to flag records to
review). AND... this survey will be done over around 80,000 people, and in
several languages, including asian languages. So whatever we do, it has to
scale. Fortunately, this processing won't be done all at once, but will
trickle in. Still we don't want this to bog the thing down.
Personally I think its way too much processing not to come out of it with
some nice abstract of the comment that assists in getting MEANING out of the
responses... This is so crude... not exactly knowledge management! There
are some text mining tools that would do it, but this group is not
necessarily for a warp speed transition to state of the art.
Any ideas would be appreciated!