Date: Sat, 8 Jan 2011 10:52:07 -0500
Reply-To: "Kirby, Ted" <ted.kirby@LEWIN.COM>
Sender: "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From: "Kirby, Ted" <ted.kirby@LEWIN.COM>
Subject: Re: Maximum Size for Character Variables?
In-Reply-To: A<AANLkTiny6tOvP+4HQzfDmbasQBHL_pjmKMRy_PqaP311@mail.gmail.com>
Content-Type: text/plain; charset="us-ascii"
Joe,
When I run your code as given, the problem is compounded. When sing the
FILEREF "afile" PROC EXPORT defines the LRECL as 256, which of course
gives the usual line truncation note. Under this, the variable X
(called "VAR3" by PROC EXPORT) has a length of 500 despite the fact that
this length is larger than the LRECL:
239
240 proc import file=afile dbms=csv out=readin_import replace;
240!
guessingrows=1050;
241 getnames=no;
242 run;
243
/**********************************************************************
244 * PRODUCT: SAS
245 * VERSION: 9.1
246 * CREATOR: External File Interface
247 * DATE: 08JAN11
248 * DESC: Generated SAS Datastep Code
249 * TEMPLATE SOURCE: (None Specified.)
250
***********************************************************************/
251 data READIN_IMPORT ;
252 %let _EFIERR_ = 0; /* set the ERROR detection macro variable */
253 infile AFILE delimiter = ',' MISSOVER DSD ;
254 informat VAR1 $4. ;
255 informat VAR2 $5. ;
256 informat VAR3 $500. ;
257 informat VAR4 $7. ;
258 informat VAR5 $3. ;
259 format VAR1 $4. ;
260 format VAR2 $5. ;
261 format VAR3 $500. ;
262 format VAR4 $7. ;
263 format VAR5 $3. ;
264 input
265 VAR1 $
266 VAR2 $
267 VAR3 $
268 VAR4 $
269 VAR5 $
270 ;
271 if _ERROR_ then call symputx('_EFIERR_',1); /* set ERROR
detection macro variable */
272 run;
NOTE: The infile AFILE is:
File Name=C:\Temp\_TD4228\#LN00009,
RECFM=V,LRECL=256
NOTE: 1200 records were read from the infile AFILE.
The minimum record length was 31.
The maximum record length was 256.
One or more lines were truncated.
NOTE: The data set WORK.READIN_IMPORT has 1200 observations and 5
variables.
NOTE: DATA statement used (Total process time):
real time 0.01 seconds
cpu time 0.01 seconds
1200 rows created in READIN_IMPORT from
AFILE.
NOTE: .READIN_IMPORT was successfully created.
NOTE: PROCEDURE IMPORT used (Total process time):
real time 1.71 seconds
cpu time 1.70 seconds
Now, when I use the actual file name of the "afile" FILEREF (copied from
the log when "afile" was created). The LRECL now is 32767, but X (VAR3)
still has a length of 500:
273 PROC IMPORT OUT= WORK.TESTREADIN
274 DATAFILE= "C:\Temp\_TD4228\#LN00009"
275 DBMS=CSV REPLACE;
276 guessingrows=1050;
277 GETNAMES=NO;
278 DATAROW=1;
279 RUN;
280
/**********************************************************************
281 * PRODUCT: SAS
282 * VERSION: 9.1
283 * CREATOR: External File Interface
284 * DATE: 08JAN11
285 * DESC: Generated SAS Datastep Code
286 * TEMPLATE SOURCE: (None Specified.)
287
***********************************************************************/
288 data WORK.TESTREADIN ;
289 %let _EFIERR_ = 0; /* set the ERROR detection macro variable */
290 infile 'C:\Temp\_TD4228\#LN00009' delimiter = ',' MISSOVER DSD
lrecl=32767 ;
291 informat VAR1 $4. ;
292 informat VAR2 $5. ;
293 informat VAR3 $500. ;
294 informat VAR4 $7. ;
295 informat VAR5 $3. ;
296 format VAR1 $4. ;
297 format VAR2 $5. ;
298 format VAR3 $500. ;
299 format VAR4 $7. ;
300 format VAR5 $3. ;
301 input
302 VAR1 $
303 VAR2 $
304 VAR3 $
305 VAR4 $
306 VAR5 $
307 ;
308 if _ERROR_ then call symputx('_EFIERR_',1); /* set ERROR
detection macro variable */
309 run;
NOTE: The infile 'C:\Temp\_TD4228\#LN00009' is:
File Name=C:\Temp\_TD4228\#LN00009,
RECFM=V,LRECL=32767
NOTE: 1200 records were read from the infile 'C:\Temp\_TD4228\#LN00009'.
The minimum record length was 31.
The maximum record length was 1230.
NOTE: The data set WORK.TESTREADIN has 1200 observations and 5
variables.
NOTE: DATA statement used (Total process time):
real time 0.01 seconds
cpu time 0.01 seconds
What goes on here?
-----Original Message-----
From: SAS(r) Discussion [mailto:SAS-L@LISTSERV.UGA.EDU] On Behalf Of Joe
Matise
Sent: Saturday, January 08, 2011 12:58 AM
To: SAS-L@LISTSERV.UGA.EDU
Subject: Re: Maximum Size for Character Variables?
Art, I *think* that windows registry fix applies to Excel (XLS,
specifically, the XLSX conversion is in another location) only. CSV
seems
to be an SCL program entirely, as far as I can tell (though the actual
mechanics is well beyond my expertise).
I don't find any unexpected performance, however, locally. I have a
basically fresh 9.2 TSM2 install on the PC I tested this on (no settings
changed from defaults whatsoever), I haven't fiddled with registries on
this
windows machine (again, fairly fresh install); only caveat is it's on
Win7
64, so who knows what that does. But this test program behaves as
expected,
I think; I have a single variable that has a size > 1024 characters, way
below the normal guessingrows value (omitting guessingrows ends with a
size
of $33), being read in as a much larger field when I increase
guessingrows
sufficiently to meet the need.
filename afile temp;
data null;
length x $1300;
file afile dlm=',' lrecl=5000;
do _n_ = 1 to 1200;
x="abc"||repeat('0',_n_)||'cba';
a="test";
b="hello";
y="goodbye";
z="end";
put
a $
b $
x $
y $
z $
;
end;
run;
data readin;
infile afile lrecl=5000 dlm=',';
length x $1300;
input
a $
b $
x $
y $
z $
;
run;
proc import file=afile dbms=csv out=readin_import replace;
guessingrows=1050;
getnames=no;
run;
Not only does the PROC IMPORT code get you the full length, but it goes
above and beyond and gets another couple hundred columns beyond what row
1050 needs.
I would be curious if you have any issues with the encoding of the file
(ANSI vs UTF-8, for example)?
-Joe
On Fri, Jan 7, 2011 at 7:09 PM, Arthur Tabachneck <art297@rogers.com>
wrote:
> Ted,
>
> You probably have to change Window's guessing rows setting. Take a
look at
> the following post from data_null: http://xrl.us/bfa8qi
>
> Art
> ------
> On Fri, 7 Jan 2011 20:02:17 -0500, Kirby, Ted <ted.kirby@LEWIN.COM>
wrote:
>
> >I have a rather messy CSV file (messy in the sense that the variable
> >values are all over the place). I was surprised at how well PROC
IMPORT
> >did in creating a SAS dataset, especially if I set the GUESSINGROWS
> >option high enough so that the variables have the correct format
> >(character or numeric) and size. My question is:
> >
> >
> >
> >Is there some sort of SAS option that restricts the size of character
> >variables that will be read with PROC IMPORT? I have scanned the
list
> >of SAS Options in the SAS documentation on my PC, but did not any
option
> >name that seemed to deal with variable lengths (other than the
MVARSIZE
> >option that deals with the size of macro variables).
> >
> >
> >
> >The reason for my question is that one instance of the character
> >variable "overall" contains 1,026 characters. This occurs on the
> >1,323rd row of the CSV file. When I use PROC IMPORT to read the data
> >and set GUESSINGROWS to 1,400 (to be sure SAS looks at the row with
the
> >longest instance of the "overall" variable), PROC IMPORT assigns a
> >length of 500 to the "overall" variable, not the correct length of
> >1,026.
> >
> >
> >
> >This question is somewhat academic because I have also read this file
> >with an INPUT statement and the colon format modifier as in:
> >
> >
> >
> >data client;
> >
> >infile "c:\clientfile.csv" dsd missover lrecl = 20000 firstobs = 2;
> >
> >input
> >
> >ClientID :4.
> >
> >fname :$22.
> >
> >
> >
> ><lots of other variables>
> >
> >
> >
> >Overall :$1026.
> >
> >
> >
> ><lots more variables>
> >
> >
> >
> >;
> >
> >run;
> >
> >
> >
> >but I just wanted to understand what was happening.
> >
> >
> >
> >
> >
> >
> >************* IMPORTANT - PLEASE READ ********************
> >
> >This e-mail, including attachments, may include confidential and/or
> proprietary information,
> >and may be used only by the person or entity to which it is
addressed. If
> the reader of this
> >e-mail is not the intended recipient or his or her authorized agent,
the
> reader is hereby
> >notified that any dissemination, distribution or copying of this
e-mail is
> prohibited. If you
> >have received this e-mail in error, please notify the sender by
replying
> to
> this message
> >and delete this e-mail immediately.
> >
>
************* IMPORTANT - PLEASE READ ********************
This e-mail, including attachments, may include confidential and/or proprietary information,
and may be used only by the person or entity to which it is addressed. If the reader of this
e-mail is not the intended recipient or his or her authorized agent, the reader is hereby
notified that any dissemination, distribution or copying of this e-mail is prohibited. If you
have received this e-mail in error, please notify the sender by replying to this message
and delete this e-mail immediately.