LISTSERV at the University of Georgia
Menubar Imagemap
Home Browse Manage Request Manuals Register
Previous messageNext messagePrevious in topicNext in topicPrevious by same authorNext by same authorPrevious page (January 2009, week 2)Back to main SAS-L pageJoin or leave SAS-L (or change settings)ReplyPost a new messageSearchProportional fontNon-proportional font
Date:         Tue, 13 Jan 2009 13:23:29 +0100
Reply-To:     Allen Ziegenfus <aziegenfus@ANAXIMA.COM>
Sender:       "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From:         Allen Ziegenfus <aziegenfus@ANAXIMA.COM>
Subject:      AW: Frequency count of words
Comments: To: karma <dorjetarap@GOOGLEMAIL.COM>
In-Reply-To:  <d6a0d8f10901130405n1d099c32m2423bf064976266c@mail.gmail.com>
Content-Type: text/plain; charset="iso-8859-1"

Hi karma,

If there is a lot of input data, lstring will bump up against the variable length limit (max 32767 I believe) and stop counting words.

Allen

-----Ursprüngliche Nachricht----- Von: SAS(r) Discussion [mailto:SAS-L@LISTSERV.UGA.EDU] Im Auftrag von karma Gesendet: Dienstag, 13. Januar 2009 13:06 An: SAS-L@LISTSERV.UGA.EDU Betreff: Re: Frequency count of words

Heres a hashing solution that breaks the strings up and keeps each unique string in a hash table. A long string that is a concatenation of all the dline strings is made. Finally each string in the hash is used for searching the long string.

HTH

data lyrics; infile datalines dsd dlm = "|" missover firstobs = 1; input dline :$20000.; datalines; There I was completely wasting, out of work and down All inside its so frustrating as I drift from town to town Feel as though nobody cares if I live or die So I might as well begin to put some action in my life Breaking the law, breaking the law Breaking the law, breaking the law Breaking the law, breaking the law Breaking the law, breaking the law ; run;

data wordcount(keep=word count); length word $20 lstring $3000; declare hash hh(); hh.definekey('word'); hh.definedata('word'); hh.definedone(); do until(eof); set lyrics end=eof; nwords=countw(dline); do _n_=1 to nwords; word = scan(dline,_n_,', '); hh.replace(); end; lstring = catx(' ',lstring,dline); end; declare hiter iter("hh"); rc = iter.first(); do while (rc=0); count = count(lstring, strip(word),'i'); output; rc = iter.next(); end; run; proc print;run;

2009/1/13 Anindya Mozumdar <anindya.lugbang@gmail.com>: > All, > Supposing I have a dataset which is created this way - > > data lyrics; > infile datalines dsd dlm = "|" missover firstobs = 1; > input dline :$20000.; > datalines; > There I was completely wasting, out of work and down > All inside its so frustrating as I drift from town to town > Feel as though nobody cares if I live or die > So I might as well begin to put some action in my life > Breaking the law, breaking the law > Breaking the law, breaking the law > Breaking the law, breaking the law > Breaking the law, breaking the law > ; > run; > > What I want is a dataset called word_counts, containing two variables > word and count which will be the number of times each word occurs in > any line in the above dataset. For example, given the dataset lyrics, > word_counts should contain > > word count > completely 1 > breaking 8 > frustrating 1 > .... > > Can any of you suggest a solution for this problem? Thanks in advance. > > Regards, > Anindya >


Back to: Top of message | Previous page | Main SAS-L page