Date: Fri, 24 Sep 2010 18:41:43 -0400
Reply-To: Sigurd Hermansen <HERMANS1@WESTAT.COM>
Sender: "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From: Sigurd Hermansen <HERMANS1@WESTAT.COM>
Subject: Re: Regular expression question on fixing csv file
In-Reply-To: <AANLkTin=fq5AgXAF5i8QgM4N4DqUjTysswDExioWCJYR@mail.gmail.com>
Content-Type: text/plain; charset="us-ascii"
Ming:
How does your client generate the csv file?
I am working on a little article and presentation on using MS Excel workbooks as SAS data sources. Excel doesn't control for many types of errors. (Send me any examples you may have found of Excel shortcomings from the viewpoint of a SAS programmer, if you will, and I'll include them.) One feature that Excel 2007 includes, though, guards against mixing up commas used as separators and those used as text. For example, a csv file exported from Excel has this correct syntax:
",",,",",
,",",,","
While you may have patterns in data that you have seen, using the same character as text and metacharacter will eventually BYOYA. Logically you should have nine commas between what appear to be ID numbers and perhaps three commas between the ID and another number. But how do you know that each row will have the same sequence of data types as other rows?
S
-----Original Message-----
From: SAS(r) Discussion [mailto:SAS-L@LISTSERV.UGA.EDU] On Behalf Of Ming Chen
Sent: Friday, September 24, 2010 4:01 PM
To: SAS-L@LISTSERV.UGA.EDU
Subject: Regular expression question on fixing csv file
Hi All,
I am trying to import a client csv format data file with comma within some
columns. Since the client won't fix their crap and the boss thinks we can
fix it by at least manually removing the comma in the txt field.
By checking the data, I am sure that regular expression can come to rescue
me since there are some obvious patterns in the data. However, I am just
know a little bit the regular expression stuff and I really need some help
here.
Here are some sample data:
columns: code1, code1_desc, code2, code2_desc, code3,code3_desc,
code4,code4_desc,code5,code5_desc
111111,VANITY TOP OVAL , 31X22",4456,MARBLE FACTORY,
The,30,Plumbing,3002,Finish Plumbing,060,Bathroom Furniture
22222,MIRROR 251/2X21/4X33, NLCM-2433,9250,FOREMOST ,GROUPS
INC,30,Plumbing,3002,Finish Plumbing,060,Bathroom Furniture
3333,ACID ,BRUSH 1/2" 3/PK,1120,OATEY CANADA SUPPLY CHAIN
SERV.CO.,30,Plumbing,3004,Rough
Plumbing
444,FLARED MALE AD,. 1/4X1/4 48-4B,4056,FAIRVIEW FITTINGS & MFG,
LTD,30,,3004,Rough Plumbing
I am sure that I can fix the csv file just by removing the comma in the 5
code_desc columns and there are only number with comma before and after the
desc part.
Thanks
Ming
|