Date: Tue, 13 Jan 2009 13:23:29 +0100
Reply-To: Allen Ziegenfus <aziegenfus@ANAXIMA.COM>
Sender: "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From: Allen Ziegenfus <aziegenfus@ANAXIMA.COM>
Subject: AW: Frequency count of words
In-Reply-To: <d6a0d8f10901130405n1d099c32m2423bf064976266c@mail.gmail.com>
Content-Type: text/plain; charset="iso-8859-1"
Hi karma,
If there is a lot of input data, lstring will bump up against the variable
length limit (max 32767 I believe) and stop counting words.
Allen
-----Ursprüngliche Nachricht-----
Von: SAS(r) Discussion [mailto:SAS-L@LISTSERV.UGA.EDU] Im Auftrag von karma
Gesendet: Dienstag, 13. Januar 2009 13:06
An: SAS-L@LISTSERV.UGA.EDU
Betreff: Re: Frequency count of words
Heres a hashing solution that breaks the strings up and keeps each
unique string in a hash table. A long string that is a concatenation
of all the dline strings is made. Finally each string in the hash is
used for searching the long string.
HTH
data lyrics;
infile datalines dsd dlm = "|" missover firstobs = 1;
input dline :$20000.;
datalines;
There I was completely wasting, out of work and down
All inside its so frustrating as I drift from town to town
Feel as though nobody cares if I live or die
So I might as well begin to put some action in my life
Breaking the law, breaking the law
Breaking the law, breaking the law
Breaking the law, breaking the law
Breaking the law, breaking the law
;
run;
data wordcount(keep=word count);
length word $20 lstring $3000;
declare hash hh();
hh.definekey('word');
hh.definedata('word');
hh.definedone();
do until(eof);
set lyrics end=eof;
nwords=countw(dline);
do _n_=1 to nwords;
word = scan(dline,_n_,', ');
hh.replace();
end;
lstring = catx(' ',lstring,dline);
end;
declare hiter iter("hh");
rc = iter.first();
do while (rc=0);
count = count(lstring, strip(word),'i');
output;
rc = iter.next();
end;
run;
proc print;run;
2009/1/13 Anindya Mozumdar <anindya.lugbang@gmail.com>:
> All,
> Supposing I have a dataset which is created this way -
>
> data lyrics;
> infile datalines dsd dlm = "|" missover firstobs = 1;
> input dline :$20000.;
> datalines;
> There I was completely wasting, out of work and down
> All inside its so frustrating as I drift from town to town
> Feel as though nobody cares if I live or die
> So I might as well begin to put some action in my life
> Breaking the law, breaking the law
> Breaking the law, breaking the law
> Breaking the law, breaking the law
> Breaking the law, breaking the law
> ;
> run;
>
> What I want is a dataset called word_counts, containing two variables
> word and count which will be the number of times each word occurs in
> any line in the above dataset. For example, given the dataset lyrics,
> word_counts should contain
>
> word count
> completely 1
> breaking 8
> frustrating 1
> ....
>
> Can any of you suggest a solution for this problem? Thanks in advance.
>
> Regards,
> Anindya
>
|