Date: Mon, 4 Dec 2006 19:12:17 -0500
Reply-To: John Birken <zbq5@CDC.GOV>
Sender: "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From: John Birken <zbq5@CDC.GOV>
Subject: Re: How to Read PDF/RTF data in to SAS datasets
Content-Type: text/plain; charset=ISO-8859-1
Dwi:
You're facing a tricky problem. I have to cope with 50,000+ line pdf text
conversions many times a year. Automatic you say - it can be done but I
recommend you learn to do it manually first. If you don't work with many
new data sets/day it’s probably not worth it - unless you want some
programming fun. The future SAS version Beta is supposed to be able to
handle pdfs - few people have it.
For us mortals (not Ron Fehd):
I import it into Excel setting the columns where I want them. If your data
has many page titles and lots of formats as my data is, you might have to
clean them Excel or for long data tables you may put some conditional
statements in SAS or Excel to ignore the garbage. It all depends on how
much garbage you have.
Here the operations:
* Save the pdf as plain text.
* From Excel open the txt file.
* Text import wizard step 1 delimiter spaces ONLY.
* Numbers between the spaces appear in their own columns.
* Excel seems to know to ignore the title garbage for column setting
* You can import this into SAS or read it as a column or flat file.
If your data is clean - no page titles and many fancies just save the pdf
in plain text - read it into SAS without Excel.
If you find this time consuming and want to do other things - BUY one of
the many shareware and commercial software packages that do this. They run
from $29 - $89 and one or two wanted >>>$100 – they don’t have a clue.
I recently tried a $29 one, it advertised not to expire (the expensive ones
expire in 5 days). It worked fine pdf >/xls/doc/txt/more but for only 3
pages. For more pages purchase it. It’s a lot better than the expensive
ones that stamp their own insignia all over the output and expire in 5
days. I have to see your data to be specific.
If you need additional details post the question or e-mail me directly.
HTH,
John, jbirken@cdc.gov