Date: Fri, 12 Mar 1999 10:52:41 +1100
Reply-To: Tim CHURCHES <TCHUR@DOH.HEALTH.NSW.GOV.AU>
Sender: "SAS(r) Discussion" <SAS-L@UGA.CC.UGA.EDU>
From: Tim CHURCHES <TCHUR@DOH.HEALTH.NSW.GOV.AU>
Subject: SAS, BASS, and Linux -Reply
Content-Type: text/plain; charset=US-ASCII
(Karsten's post is reproduced below)
Karsten,
I agree almost entirely with your summary of the most useful bits of SAS (we can argue over details later..), and I agree that it would be better to clone the functionality of SAS rather than the exact syntax and features. I also agree that it would be vital to provide a bridge from existing SAS code (and expertise) to any free SAS alternative. A bridge from other products such as SPSS would also be possible, which means that the PSPP (free SPSS clone) project might join forces with a alternative to SAS project. And yes, there is a hell of a lot of good statistical code and procedures out there which could be incorporated into (or interfaced to) a SAS alternative, and yes, the monolithic approach is not the way to go. I think that this is the lesson to be learnt from Linux, which started out as just a lot of free Unix components assembled around a new (free) Unix-like kernel - the Linux project didn't slavishly clone an existing Unix but it didn't try to totally re-invent the wheel either, so it could build on and incorporate what went before. That's one of the reasons why Linux is growing so rapidly while BeOS and other totally new operating systems are not. A free alternative to SAS could follow the same path.
And yes, it is a shame that the BASS system was written in Pascal and assembler, but having sources available would still be a help. Hmmm, 10 person years of effort shared between 120 people is only 1 month, or more realistically, about 2 years worth of rainy Sundays afternoons...
I don't think that SI need worry about such an initiative - just as people are not ripping out their IBM mainframes and replacing them with Linux boxes, I don't think that any of SI's big clients would replace SAS with a free alternative, at least not for quite a while!
Tim Churches
>>> Karsten Self <kmself@IX.NETCOM.COM> 12/March/1999 07:54am >>>
I just spoke with Jeff Bass of Amgen, creator of the BASS system. This
was a 1980's implementation of the SAS language and several procedures
on PCs, regarding any possibilities this might lead to for a free SAS on
Linux. There's mixed news.
First, Jeff would be interested in seeing SAS-like capabilities on
Linux. He would be willing to help by way of providing the BASS
sources, and providing some guidance in their interpretation. He would
not be interested in doing the development directly.
That said, there are some aspects of BASS which both help and hinder:
- SAS is based on publicly available foundations -- the original
NCSU project was an FDA funded research project, and SAS through
about SAS 74 or 76 are available with sources, AFAIK (though
this would be PL/I and MVS assembler).
- BASS implemented the DATA step and about 20 commonly used
procedures.
- Many of the algorithms used in BASS are based on documentation of
early versions of the SAS system, or other published algorithms.
It should be possible to reimplement these or newer, improveder
versions.
- Due to PC limitations of the time, BASS was coded in Microsoft
Pascal, and assembler, about 80% and 20% respectively. BASS is
probably less portable than SAS itself. I don't know what language
support there is for cross-compiling or porting pascal or MS pascal
to gcc or related. The resulting code would probably be
unmaintainable, even if it ran. However, GNU does provide a number
of porting tools. I have no experience in this area.
- BASS was a code-compatible, but not a data-compatible system.
Transport format was ASCII files. These were sneakernet days,
and the possibility of widescale data distribution was not
anticipated.
- The sources are available from Jeff. The algorithms used are
frequently documented in the source. Some work may be required
to pull the sources from archival media.
- The DATA step and basic I/O were a fairly elementary coding
effort. The full BASS system represented about 4 man-years of
development. Jeff anticipates a similar project today would
require 10 man years.
My own comments follow.
What I find most useful about SAS are:
- A simple but powerful procedural data language with a decent
function library, raw data I/O abilities (format/informat),
and convenient methods for working with sorted data (FIRST.,
LAST.), and other miscellaneous features: SET/UPDATE/MERGE/
MODIFY, FILE and INFILE options, SET options, etc.
- Process accounting -- resource utilization, record, and variable
reporting following process steps.
- Persistent data attribute associations: name, type, length,
format/informat, label, and metadata about these attributes
(DICTIONARY tables).
- A set of integrated PROCs which provide trivial access to basic
data manipulation and reporting functions. I could accomplish
virtually all my work with SQL, DATA STEP, FREQ, MEANS/SUMMARY,
PRINT, UNIVARIATE, FORMAT, COMPARE, SORT, and CONTENTS. Of this
list, DATASETP, FREQ, MEANS/SUMMARY, SORT, and CONTENTS largely
roll easily into a sufficiently featured SQL. PRINT can be
accomplished in a DATA _NULL_. This leaves DATA, SQL, and
a statistics library.
...I realize other users' needs differ. Additional features include
graphics and statistical procedures, database connectivity, remote
connectivity, OS hooks, code generation (CALL EXECUTE, MACRO), data
browsing (the _only_ reason to use interactive SAS). The remaining
features of SAS provide less than 1% of my needs.
What I'm disenchanted with are:
- Macro. As much as I use and appreciate it, it is a kludge. It is
a preprocessor, not a true programming language. Debug support
is horrible. This is addressed to an extent by SCL. I'd much
prefer seeing a real control language, along the lines of Perl.
- Disconnect with other development tools. It is relatively
difficult to wed SAS with other programming tools or
environments. The fact that SAS is monolithic does not help
much in this regard. Using SAS as a server is somewhat better,
but it certainly doesn't fall into the Unix shell tools model.
It doesn't have to, but many very powerful tools do.
The SAS NIH syndrome has lead to a monolithic tool incorporating
a data language, a macro/scripting language, an SQL implementation,
a statistical library, an application development environment, a
graphics generation facility, an integrated development environment,
a data browsing/editing environment, ... _none_ of which are
of any use outside of SAS, and all of which require an annual
investment in SAS products in order to be used. My own use of SAS
tools (above) is geared largely toward what is required to get
work done in SAS, and what translates more broadly into other
areas of programming application. Hence, data step, SQL,
fundamental utilities. I've rather pointedly neglected learning
tools such as TABULATE and REPORT due to their limited and
idiosyncratic aspects.
- Lack of user definable functions / procedures (addressed to an
extent by SAS/Toolkit).
- Lack of long variable names (added in v7).
- Lack of access to higher programming features: finer grained
use of arrays, more data types (boolean, integer, long character),
better (or more standardized) regular expression tools.
- Standalone/runtime capability.
- Integration with third-party tools.
- Current level of Linux support.
I'll say again that I'm not particularly interested in building yet
another SAS; I'd rather work with existing tools available for
Unix/Linux.
Still, one approach which might be worth exploration is to come up with
a language translation utility which would translate SAS code into an
equivalent, say, Perl. The addition of a module to provide the type of
accounting and persistent data attributes available with SAS would be a
plus. Procedures could be mapped to close equivalents in existing
statistical languages. A colleague suggests that many of the SAS
statistical procedures are validated in IML, it might be possible to use
an existing matrix language as the basis for rapid development of a
statistical procedure library. Not being particularly versed in matrix
languages or advanced statistics, I can't comment on viability, but it
sounds interested.
What would really help a project like this along would be an identified
sponsor or sponsors. Again, I'm playing the role of data conduit here,
not advocate. I'd be interested but not obsessed with such a project.
--
Karsten M. Self (kmself@ix.netcom.com)
What part of "gestalt" don't you understand?