Date: Fri, 22 Oct 1999 16:18:50 +0000
Reply-To: kmself@ix.netcom.com
Sender: "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From: "Karsten M. Self" <kmself@IX.NETCOM.COM>
Organization: Self Analysis
Subject: Re: Stripping out HTML tags
Content-Type: text/plain; charset=us-ascii
> Date: Thu, 21 Oct 1999 12:04:40 -0400
> From: "Steven E. Stevens" <sstevens@LATERALTHOUGHT.COM>
> Subject: Stripping out HTML tags
>
> Would anyone be willing to share (or point me to) a chunk of SAS code
> (datastep or macro) to strip out some or all HTML tags from character
> strings? Am running SAS V7/8, so 200 byte character variable limitation is
> not an issue. Thanks in advance for any responses...
In yet another non-SAS response
The easiest solution I could think of would be to use an existing
browser to dump rendered text. Lynx, a text-based web browser, can do
just this from the command line. Under Unix it could be used via a
FILENAME PIPE as SAS input, under other platforms you would generally
dump output to a file and read this via SAS.
lynx -dump <filename or URL>
http://lynx.browser.org/
From the site:
Lynx is a text browser for the World Wide Web. Lynx 2.8.2
runs on Un*x, VMS, Windows 95/98/NT but not 3.1 or 3.11,
on DOS (386 or higher) and OS/2 EMX. The current
developmental version is also available for testing. Ports
to Mac are in beta test.
Perl also has several HTML-to-text modules. See O'Reilly's "The Perl
Cookbook" (Christianson & Torkington) pp 714 ff for more information.
For those who like to browse text-only but don't like Lynx's management
of tables or frames, may I suggest w3m:
http://freshmeat.net/appindex/1999/06/09/928951047.html
http://ei5nazha.yz.yamagata-u.ac.jp/~aito/w3m/eng/
--
Karsten M. Self (kmself@ix.netcom.com)
What part of "Gestalt" don't you understand?
SAS for Linux: http://www.netcom.com/~kmself/SAS/SAS4Linux.html
Mailing list: "subscribe sas-linux" to
mailto:majordomo@cranfield.ac.uk
9:10am up 21:28, 2 users, load average: 0.25, 0.17, 0.09