Date: Wed, 23 Feb 2000 23:43:33 GMT
Reply-To: sashole@mediaone.net
Sender: "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From: Paul Dorfman <paul_dorfman@HOTMAIL.COM>
Subject: Re: creating 4 variables from a string
Content-Type: text/plain; format=flowed
Arthur,
The situation is not all that bad since at least there is some stable
pattern, so an attempt can be made to crack it using heuristics. First, it
follows from your input that in each record, the first digital token (123,
23, 99...) is the beginning of the street address (I assume that is what you
need for your STREET variable), so CTAKER is everything preceding it. If
nothing precedes the token, CTAKER will be blank. The next breakpoint comes
from generally recognizable notations, possibly abbreviated, for a street,
lane, boulevard, route, plaza, and so on. In the code below, I used only
those present in your excerpt, i.e. 'st','ave','lane','pkwy','road','rd';
however, I see no principal problem in expanding the list (I would be very
surprised if the general full list exceeded a couple of hundred items).
Third, the state abbreviation provide the final breakpoint. Again, you will
have to expand the list to include all states. Such things as apartments,
etc. can be dealt with esoterically. Assuming that your dataset is called IN
and has 1 126-byte variable REC,
data parsed (keep=ctaker street city state);
set in;
cptr = indexc (rec,'0123456789');
if cptr > 1 then ctaker = substr (rec,1,cptr-2);
else ctaker = ' ';
do abbr = 'st ','ave ','lane','pkwy','road','rd';
sptr = indexw (rec,upcase(abbr));
if sptr then leave;
end;
do while (substr(rec,sptr,1) > ' ');
sptr ++ 1;
end;
str = substr (rec,cptr,sptr-cptr);
do state = 'ma','ct','nj','ny','pa';
zptr = indexw (rec,upcase(state));
if zptr then leave;
end;
ct = left(substr (rec,sptr,zptr-sptr));
if index (ct, upcase('apt')) then do;
city = scan (ct,3);
street = trim(str)||' '||trim(scan(ct,1))||' '||scan(ct,2);
end;
else do;
street = str;
city = ct ;
end;
run;
This produces the following parsed output:
OBS CTAKER STATE
1 ma
2 ma
3 ct
4 nj
5 ny
6 ny
7 ny
8 ny
9 ny
10 ny
11 ny
12 MARILYN MASON CENTER FOR ROBBY CARROWAY pa
13 GDN OFF NORTH CAPE CENTER FOR BILLY SMYTHE pa
CITY STREET
SPRINGFIELD 123 WORTHINGTON ST APT 506
GREAT BARRINGTON 23 SPRING ST
BRIDGEPORT 99 CLIFTON AVE
CAPE MAY 1890 H WISCONSON AVE
STATEN ISLAND 39 PARKVIEW LANE
BRONX 2103 ELMSTEAD AVE APT 9X
NEW CITY 27 LEXINGTON ST
SPRING VALLEY 353 S PASCACK AVE
BROOKLYN 9537 28 AVE
BROOKLYN 7826 OCEAN PKWY
ISLANDIA 23 SAGEBRUSH ROAD
ERIE 234 WEST AVE
SOUTH MOUNTAIN 11258 N MOUNTAIN RD
Please note that the little program above has no pretense of being The
General Theory of Everything. Instead, it is intended to demonstrate that a
sufficient dose of heuristics can make a fuzzy problem almost manageable.
Kind regards,
======================
Paul M. Dorfman
Jacksonville, FL
======================
>From: "Arthur D. Livingston" <ALiving103@AOL.COM>
>I have the following data in a variable named ADDRESS. The length is 126.
>
>123 WORTHINGTON ST APT 506 SPRINGFIELD MA
>23 SPRING ST GREAT BARRINGTON MA
>99 CLIFTON AVE BRIDGEPORT CT
>1890 H WISCONSON AVE CAPE MAY NJ
>39 PARKVIEW LANE STATEN ISLAND NY
>2103 ELMSTEAD AVE APT 9X BRONX NY
>27 LEXINGTON ST NEW CITY NY
>353 S PASCACK AVE SPRING VALLEY NY
>9537 28 AVE BROOKLYN NY
>7826 OCEAN PKWY BROOKLYN NY
>23 SAGEBRUSH ROAD ISLANDIA NY
>MARILYN MASON CENTER FOR ROBBY CARROWAY 234 WEST AVE ERIE PA
>GDN OFF NORTH CAPE CENTER FOR BILLY SMYTHE 11258 N MOUNTAIN RD SOUTH
>MOUNTAIN PA
>
>I am would like to create 4 variables as follows:
>
>CARETAKER STREET CITY STATE
>
>I have tried using everything from substrings, index, and scan but nothing
>is
>working.
>Thanks for any help.
______________________________________________________
Get Your Private, Free Email at http://www.hotmail.com
|