LISTSERV at the University of Georgia
Menubar Imagemap
Home Browse Manage Request Manuals Register
Previous messageNext messagePrevious in topicNext in topicPrevious by same authorNext by same authorPrevious page (April 2000, week 3)Back to main SAS-L pageJoin or leave SAS-L (or change settings)ReplyPost a new messageSearchProportional fontNon-proportional font
Date:   Thu, 20 Apr 2000 21:38:32 +1200
Reply-To:   Don Stanley <don_stanley@XTRA.CO.NZ>
Sender:   "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From:   Don Stanley <don_stanley@XTRA.CO.NZ>
Organization:   Xtra
Subject:   Re: Merge question - learning the Program Data Vector
Content-Type:   text/plain; charset=us-ascii

Just one point that you might like to clarify, as I don't think you explicitly state this. On MVS at least, when you load a numeric variable from a dataset to the PDV the PDV uses 8 bytes. Even if you store the variable in length 2, the PDV is still 8 bytes. The length of two is applied only when the dataset is written out. Thus, you can have a variable of length 2, assign it a value of say 65362345, and in the step you created it, that value is available in the PDV. Now start another datastep, load the dataset with the 2 byte variable back in, and you have a value that is different, because the 2 byte representation could not store the whole value when it was written out.

I just checked in Windows and it would appear 8 bytes are used in the PDV there as well. Here is an example::

31 data test ; 32 length x 3 ; 33 x = 81929 ; 34 put x= x=hex16. ; 35 run ;

X=81929 X=40F4009000000000 NOTE: The data set WORK.TEST has 1 observations and 1 variables. NOTE: The DATA statement used 0.22 seconds.

36 37 data test1 ; 38 set test ; 39 put x= x= hex16.; 40 run ;

X=81920 X=40F4000000000000 NOTE: The data set WORK.TEST1 has 1 observations and 1 variables. NOTE: The DATA statement used 0.05 seconds.

See how the first datastep uses the 4th byte as part of the value, but the second datastep, whilst it uses the 8 bytes in the PDV, has no value in the 4th byte because that precision disappeared when the dataset was saved earlier.

Don

Paul Dorfman wrote:

> Mark, > > It would make a nice beginning tutorial at a SOMESUG conference. I also like > the inobtrusive style. Why would not you consider presenting it. I > understand your SUG affiliation is both NE and SE, and both are still > soliciting presentations. Last but not least, you will have a chance to > match names with bodies. > > Kind regards, > ================== > Paul M. Dorfman > Jacksonville, Fl > ================== > > >From: "Terjeson, Mark" <TERJEMW@DSHS.WA.GOV> > >Reply-To: "Terjeson, Mark" <TERJEMW@DSHS.WA.GOV> > >To: SAS-L@LISTSERV.UGA.EDU > >Subject: Re: Merge question - learning the Program Data Vector > >Date: Wed, 19 Apr 2000 12:13:48 -0700 > > > > > Howard has done a good job in being cautious > > > with the potential of truncation. So for the benefit of > > > newer users of SAS and those less familiar with things > > > such as the compiler and PDV, and just for fun, and > > > through a little walk, the discussion of the SAS compiler > > > and the Program Data Vector (PDV) can give you > > > black-n-white knowledge and awareness of all the > > > possible attributes of behavior so that you can feel > > > secure in the way a datastep will behave each and > > > every time, as well as know how to use these methods > > > and make the appropriate adjustments to make them > > > work for you. > > > > > > For the example given below, and the topic at hand, > > > we will just be talking about "allocation" and the "size" > > > of the allocation. > > > > > > When you SUBMIT your datastep code to be run, > > > the SAS system parses your code text through a > > > compiler to 1) set things up, and 2) then executes, > > > the tasks you have written to be performed. > > > > > > The first thing is to "set things up". In SAS one of > > > the things it does to get prepared is to look through the > > > datastep code and find all the variables that will be used > > > or referred to throughout the datastep. These could be > > > variables you create "on the fly" such as X=0; without > > > a declaration, or could be variables you declare with a > > > statement such as LENGTH or ATTRIB, etc. Also, the > > > compiler goes and checks all the incoming datasets you > > > have referenced and finds all the variables that they > > > contain. You get the idea at this point that ANY variables > > > that are referenced directly or indirectly are initially > > > accounted for. > > > > > > Now, why it does that is for the benefit of setting up > > > a working bucket for each of them. In the SAS world > > > they call it the Program Data Vector or PDV. If we have > > > a sample datastep such as: > > > > > > data table2; > > > set table1; > > > run; > > > > > > where table1 contains variables A, B, and C, then when > > > the compiler sets up this datastep's PDV it will allocate a > > > bucket for each variable found such as: > > > > > > A B C > > > +-------+-------+-------+ > > > | | | | > > > | | | | > > > +-------+-------+-------+ > > > > > > The size of these work buckets will automatically match > > > the size and type found in the incoming dataset. Since the > > > SAS datastep is iterative (steps through and processes one > > > record at a time), then when it performs the SET statements, > > > the next record (observation) from the incoming TABLE1 > > > dataset is fetched and the contents of the variables in this > > > observation only will be loaded into the PDV work buckets, > > > and since the size came from the incoming file, the contents > > > will be loaded intact and unchanged. Then at the end of the > > > datastep (you know, down where the RUN statement is), the > > > implied OUTPUT to TABLE2 will write the contents of the PDV > > > out as the next observation in TABLE2. Of course, between > > > the SET and RUN, you may have written code to alter the > > > contents of the PDV's copy of the variables. > > > > > > Now for something different. If we add a new variable to > > > our datastep, such as K, remember that all variables are found > > > and will have a spot created in the setup of the PDV. The > > > next rule to remember is that the PDV is created, with the > > > variables, in the order that the compiler finds them, as it looks > > > through the datastep from top to bottom, left to right. In this case: > > > > > > data table2; > > > set table1; > > > K = 1; > > > run; > > > > > > will create a PDV like: > > > > > > A B C K > > > +-------+-------+-------+-------+ > > > | | | | | > > > | | | | | > > > +-------+-------+-------+-------+ > > > > > > > > > If we had placed the line with K earlier or used a > > > declaration statement such as LENGTH earlier, then > > > the order that K would have been found would move: > > > > > > data table2; > > > length K 8; > > > set table1; > > > K = 1; > > > run; > > > > > > will create a PDV like: > > > > > > K A B C > > > +-------+-------+-------+-------+ > > > | | | | | > > > | | | | | > > > +-------+-------+-------+-------+ > > > > > > > > > The size at which the PDV is created is the first size > > > that is found, such as the predefined size it can find > > > on an incoming dataset or the LENGTH statement, etc. > > > If the first occurrence found does not have an associated > > > length then the SAS default will be applied. A length of > > > 8 for numeric variables and $8 for character variables. > > > The PDV length is sized to the *first occurrence* found. > > > Later sizes found do not affect the allocation size of the > > > PDV work buckets. (PS: work buckets -- is not a formal > > > name, it is just slang used here for discussion sake, and > > > the formal book terms, which will be different names for > > > the same thing in different programming languages, is > > > not important for our discussion at this time). > > > > > > Another aspect of character variable sizing is that it > > > only assigns $8 if no length is specified and that the > > > first data-contents loaded into it cannot determine the > > > first-found-size. For example, here we have a sample > > > of code that shows that the size is based on the first > > > occurrence only. Statements have been arranged to > > > force the creation of the variable without any possibility > > > of the first content being determined either, thus you > > > get $8: > > > > > > data test0; > > > a = '123456789012345'; > > > run; * result of A with length $15 ; > > > > > > data test1; > > > input a $; > > > stop; > > > set test0; > > > run; * result of A with length $8 ; > > > > > > However, if the first content length for a character variable > > > was to be found, the PDV size will be that length: > > > > > > data test; > > > S = 'a'; > > > run; * result of S with length $1 ; > > > > > > The maximum size of a numeric variable in SAS is 8. > > > The minimum size of a numeric variable in SAS for some > > > operating systems such as PC, Unix, etc. is 3, and some > > > operating systems such as MVS, CMS, VSE, etc. it is 2. > > > This is due to the native storage size for numbers in each > > > operating system and how SAS has implemented them. > > > > > > Two handy tables for you: > > > > > > > > > Significant Digits and Largest Integer by Length: CMS, OS, > > > VM, MVS, VSE > > > > > > Length Significant Digits Retained Largest Integer > >Represented > > > Exactly > > > 2 2 255 > > > 3 4 65,535 > > > 4 7 16,777,215 > > > 5 9 4,294,967,295 > > > 6 12 1,099,511,627,775 > > > 7 14 281,474,946,710,655 > > > 8 16 72,057,594,037,927,935 > > > > > > > > > > > > Storage length for Integers: PC, Unix, etc. > > > > > > Length Significant Digits Retained Largest Integer > >Represented > > > Exactly > > > 3 3 8,192 > > > 4 6 2,097,152 > > > 5 8 536,870,152 > > > 6 11 137,438,953,472 > > > 7 13 35,184,372,088,832 > > > 8 15 9,007,199,254,740,992 > > > > > > > > > > > > Wow, we have a lot of stuff now. So what does this all mean? > > > > > > Let's look at one more tidbit about the allocation of the PDV. > > > Not only is the size based upon the first one found, the allocation > > > of each bucket is based upon the first copy found. This means > > > that if the programmer declares/creates a variable that is what > > > the PDV creates a work bucket with. If an incoming dataset also > > > has a variable of the same name, the compiler setting up the > > > PDV sees that one is already there and does not create a > > > second work bucket, it says there is already one there to use. > > > > > > Let's go back to the original sample given below and tie all > > > of this stuff together. > > > > > > Data C; > > > Merge A B; > > > By X; > > > Run; > > > > > > The original dataset A contained two variables X and Y, > > > both of which are defaulted to numeric length 8. The original > > > dataset B contained two variables X and Z, both of which are > > > defaulted to numeric length 8. > > > > > > In the MERGE datastep, the scanning of the compiler will > > > look for the first occurrences of all variables and create a PDV > > > finding dataset A referenced first (top to bottom, left to right) > > > thus starting a PDV finding a numeric variable length 8 called > > > X and then a numeric length 8 called Y: > > > > > > X Y > > > +-------+-------+ > > > | | | > > > | | | > > > +-------+-------+ > > > > > > Then it finds dataset B with a numeric length 8 called X, > > > and at this time the compiler sees that it already has a > > > work bucket called X, so it skips on to the next to be > > > found which is a numeric length 8 called Z, and since > > > it doesn't find one in the PDV it creates another work > > > bucket for Z and adds it to the PDV: > > > > > > > > > X Y Z > > > +-------+-------+-------+ > > > | | | | > > > | | | | > > > +-------+-------+-------+ > > > > > > > > > Another goodie you'll need to know is that each work > > > bucket created for dataset variables internally retains > > > the contents (at different times) across to the next > > > iteration(observation), such as concatenating with the > > > SET statement (e.g. SET TABLE1 TABLE2; ) the > > > dataset variables in the PDV will be reset at each new > > > dataset. For interleaving (SET and BY statements) the > > > dataset variables will be reset at each new dataset and > > > at each new BY group. For MERGE with BY, the dataset > > > variables are reset at each BY group. These are some > > > of the rules denoting where resetting takes place while > > > other situations the dataset variables get retained. You > > > can check your SAS Language manual for the complete > > > particulars. I state just enough here to cover a couple > > > of the combinations in the discussion further on. > > > > > > User created variables are not retained automatically. > > > The user can specify which user created variables to be > > > retained by listing them with the RETAIN statement. > > > > > > Thus, the original posted behavior had the dataset > > > variable work buckets keep (duplicate) contents across > > > records with duplicate BY values. John's question was > > > to stop this retention of content within the BY value group > > > and only have values when there physically was a record > > > in the merged file to be loaded into the PDV work bucket. > > > > > > At this point we will see that the consideration we need > > > to be aware of (truncation) for numeric and character > > > variables (in one aspect) are inverted from each other. > > > (i.e. numeric defaults to the biggest, character to the smallest) > > > For the most part, character lengths need to be cared for > > > in the aspect of truncation since the allocation size of the > > > variable in the PDV may be of a smaller size than later > > > values to be loaded into it. Example: if a character variable > > > in the PDV gets sized to $3 and the values being loaded in > > > can be larger, then the larger values are loaded into a work > > > bucket of length 3 and thus get truncated. > > > > > > An example for the character length truncation: > > > > > > data test; > > > S = 'a'; output; > > > S = 'def'; output; > > > run; * result of S with length $1 ; > > > * and record 1 containing 'a' and ; > > > * the record 2 containing 'd' ; > > > > > > > > > The numeric default for an undeclared numeric length will > > > be 8 which is the maximum size, so even if the numeric lengths > > > being loaded in are smaller, the value will always fit. This is why > > > in the case of numerics it is possible to always be sure of the > > > expected results when a shortcut trick like below is used such > > > as the Z=.; (providing that it gets allocated to the max size and > > > not declared smaller than the incoming dataset size). > > > > > > > > > With confidence we can make this work for ourselves, > > > knowing that in the modified example: > > > > > > Data C; > > > z = .; > > > Merge A B; > > > By X; > > > Run; > > > > > > the Program Data Vector (PDV) created, finds the user's > > > copy of Z first and thus allocates a PDV like: > > > > > > Z X Y > > > +-------+-------+-------+ > > > | | | | > > > | | | | > > > +-------+-------+-------+ > > > > > > and that on the beginning of each iteration we assign > > > a missing value to the Z work bucket, and then the next > > > statement executes and reads (via the MERGE) the records > > > from both datasets A and B (where found, according to the > > > BY statement) and loads values for X and Y and Z if they > > > were found. If a Z value was really found and loaded, it > > > would overwrite our assignment just previously done. > > > If there wasn't a Z value found, our previously assigned > > > value of missing would remain intact. Thus the behavior > > > that John's initial request was looking for, also knowing that > > > truncation won't occur in the numeric setting, since it was > > > being loaded into a PDV bucket of maximum size for the > > > numeric. > > > > > > For character values, yes, you can also have truncation, > > > if you don't pay attention to what PDV sizes and what sizes > > > are coming in and going out, etc. But for character > > > situations you still can orchestrate and be confident in the > > > expected results if you stay aware of these rules and you > > > also *know* your data and the incoming dataset variable > > > sizes, etc. It is possible to use the LENGTH statement for > > > example, on the user's pre-initialized variable, to make sure > > > the work bucket created in the PDV is of adequate size to > > > avoid any data content corruption due to truncation. > > > > > > So always take Howard's advice and be careful, but with > > > thought and careful application of a number of aspects of > > > SAS rules and behavior, you can make both work for you, > > > if you can arrange things to occur in your desired order. > > > > > > Wouldn't it be way-cool if we could fetch an array of the > > > PDV's current attributes (i.e. length, type, retain flag, etc.). > > > > > > Hope this is helpful, and I know many parts of this will > > > be helpful in a great number of other situations you find, > > > Mark Terjeson > > > Washington State Department of Social and Health Services > > > Division of Research and Data Analysis (RDA) > > > (360) 902-0741 > > > (360) 902-0705 fax > > > mailto:terjemw@dshs.wa.gov > > > > > > > > > > > > > > > > > > - > > > -----Original Message----- > > > From: Howard Schreier [SMTP:Howard_Schreier@ITA.DOC.GOV] > > > Sent: Tuesday, April 18, 2000 8:36 AM > > > To: SAS-L@LISTSERV.UGA.EDU > > > Subject: Re: Merge question > > > > > > John's original code generates this note on the log: > > > > > > MERGE statement has more than one data set with repeats of BY values. > > > > > > In my experience that usually points to a problem with at least the > > > potential for generating unwanted results. > > > > > > Mark's idea (initializing Z to missing via an assignment statement > > > before the MERGE statement) solves the problem in this case. If it's > > > possible that for some BY groups B rather than A has "extra" > > > observations, and assuming that symmetric treatment is desired, add > > > > > > y = .; > > > > > > This method could get clumsy if there are a lot of variables to > > > initialize. It also works less well for character variables, where it's > > > too easy to unintentionally alter lengths and perhaps truncate data > > > values. To illustrate, I reworked the example, making Y and Z character > > > rather than numeric. > > > > > > Data A; > > > input X Y $; > > > cards; > > > 1 1a > > > 1 2a > > > 1 3a > > > 1 4a > > > 2 1b > > > 2 2b > > > 2 3b > > > ; > > > Data B; > > > input X Z $; > > > cards; > > > 1 1xxx > > > 1 1yyy > > > 1 2zzz > > > 2 1www > > > ; > > > > > > Now try: > > > > > > Data C; > > > z= ' '; > > > Merge A B; > > > By X; > > > put x y z; > > > Run; > > > > > > All values of Z get truncated to length=1 because that's the length of > > > the constant in the assignment statement. One can attend to that > > > explicitly, but it can get to be a lot of trouble. Here's a rather > > > bizarre DATA step which takes care of the problem implicitly. > > > > > > Data C; > > > if 2+2=5 then merge a b; > > > z= ' '; > > > Merge A B; > > > By X; > > > put x y z; > > > Run; > > > > > > The first MERGE is never executed and is present only for compile-time > > > effects, in particular the determination of lengths. > > > > > > However, my recommendation is to get away from the problem of repeats > > > within BY groups altogether: > > > > > > data a2merge; > > > set a; by x; > > > if first.x then __num = 0; > > > __num + +1; > > > run; > > > > > > data b2merge; > > > set b; by x; > > > if first.x then __num = 0; > > > __num + +1; > > > run; > > > > > > > > > Data C; > > > Merge A2merge B2merge; > > > By X __num; > > > drop __num; > > > Run; > > > > > > > From: "Terjeson, Mark" <TERJEMW@DSHS.WA.GOV> > > > > > > > > Hi John, > > > > > > > > ...interesting-row-count-match... > > > > > > > > This seems to do what you desire: > > > > (make sure you initialize before the merge statement) > > > > > > > > Data C; > > > > z = .; > > > > Merge A B; > > > > By X; > > > > Run; > > > > > > > > Hope this is helpful, > > > > Mark Terjeson > > > > Washington State Department of Social and Health > > > > Services > > > > Division of Research and Data Analysis (RDA) > > > > (360) 902-0741 > > > > (360) 902-0705 fax > > > > mailto:terjemw@dshs.wa.gov > > > > > > > > > > > > > > > > > -----Original Message----- > > > > > From: abcx888@MY-DEJA.COM [SMTP:abcx888@MY-DEJA.COM] > > > > > > > > > > Hi, > > > > > > > > > > I have two data sets A and B need to be match-merged. > > > > > > > > > > Dataset A: (two variables X and Y) > > > > > > > > > > X Y > > > > > 1 1 > > > > > 1 2 > > > > > 1 3 > > > > > 1 4 > > > > > 2 1 > > > > > 2 2 > > > > > 2 3 > > > > > > > > > > Dataset B: (two variables X and Z) > > > > > > > > > > X Z > > > > > 1 1 > > > > > 1 1 > > > > > 1 2 > > > > > 2 1 > > > > > > > > > > By using following SAS code, I got the merged dataset > > > > C. > > > > > > > > > > Data C; > > > > > Merge A B; > > > > > By X; > > > > > Run; > > > > > > > > > > Dataset C: (three variables) > > > > > > > > > > X Y Z > > > > > 1 1 1 > > > > > 1 2 1 > > > > > 1 3 2 > > > > > 1 4 2 <-- > > > > > 2 1 1 > > > > > 2 2 1 <-- > > > > > 2 3 1 <-- > > > > > > > > > > Is there a way to match merge above A and B in such way that the > >value > > > > > of Z for those extra observations in each BY group is missing > >instead > > > of > > > > > inheriting the value from previous observation? The idea result of > > > > > Dataset C is > > > > > > > > > > X Y Z > > > > > 1 1 1 > > > > > 1 2 1 > > > > > 1 3 2 > > > > > 1 4 . > > > > > 2 1 1 > > > > > 2 2 . > > > > > 2 3 . > > > > > > > > > > > > > > > Thanks > > > > > > > > > > John Wang > > ______________________________________________________ > Get Your Private, Free Email at http://www.hotmail.com

-- Don Stanley, B.SC, Dip O.R.S, MNZCS Director, Sysware Consulting Group Box 634, Wellington, NEW ZEALAND http://www.sysware.co.nz EMAIL:: don_stanley@xtra.co.nz http://www.geocities.com/don_stanley_nz/don_home.htm Genealogy:: http://www.geocities.com/don_stanley_nz/family.htm


Back to: Top of message | Previous page | Main SAS-L page