|
Mark,
It would make a nice beginning tutorial at a SOMESUG conference. I also like
the inobtrusive style. Why would not you consider presenting it. I
understand your SUG affiliation is both NE and SE, and both are still
soliciting presentations. Last but not least, you will have a chance to
match names with bodies.
Kind regards,
==================
Paul M. Dorfman
Jacksonville, Fl
==================
>From: "Terjeson, Mark" <TERJEMW@DSHS.WA.GOV>
>Reply-To: "Terjeson, Mark" <TERJEMW@DSHS.WA.GOV>
>To: SAS-L@LISTSERV.UGA.EDU
>Subject: Re: Merge question - learning the Program Data Vector
>Date: Wed, 19 Apr 2000 12:13:48 -0700
>
> > Howard has done a good job in being cautious
> > with the potential of truncation. So for the benefit of
> > newer users of SAS and those less familiar with things
> > such as the compiler and PDV, and just for fun, and
> > through a little walk, the discussion of the SAS compiler
> > and the Program Data Vector (PDV) can give you
> > black-n-white knowledge and awareness of all the
> > possible attributes of behavior so that you can feel
> > secure in the way a datastep will behave each and
> > every time, as well as know how to use these methods
> > and make the appropriate adjustments to make them
> > work for you.
> >
> > For the example given below, and the topic at hand,
> > we will just be talking about "allocation" and the "size"
> > of the allocation.
> >
> > When you SUBMIT your datastep code to be run,
> > the SAS system parses your code text through a
> > compiler to 1) set things up, and 2) then executes,
> > the tasks you have written to be performed.
> >
> > The first thing is to "set things up". In SAS one of
> > the things it does to get prepared is to look through the
> > datastep code and find all the variables that will be used
> > or referred to throughout the datastep. These could be
> > variables you create "on the fly" such as X=0; without
> > a declaration, or could be variables you declare with a
> > statement such as LENGTH or ATTRIB, etc. Also, the
> > compiler goes and checks all the incoming datasets you
> > have referenced and finds all the variables that they
> > contain. You get the idea at this point that ANY variables
> > that are referenced directly or indirectly are initially
> > accounted for.
> >
> > Now, why it does that is for the benefit of setting up
> > a working bucket for each of them. In the SAS world
> > they call it the Program Data Vector or PDV. If we have
> > a sample datastep such as:
> >
> > data table2;
> > set table1;
> > run;
> >
> > where table1 contains variables A, B, and C, then when
> > the compiler sets up this datastep's PDV it will allocate a
> > bucket for each variable found such as:
> >
> > A B C
> > +-------+-------+-------+
> > | | | |
> > | | | |
> > +-------+-------+-------+
> >
> > The size of these work buckets will automatically match
> > the size and type found in the incoming dataset. Since the
> > SAS datastep is iterative (steps through and processes one
> > record at a time), then when it performs the SET statements,
> > the next record (observation) from the incoming TABLE1
> > dataset is fetched and the contents of the variables in this
> > observation only will be loaded into the PDV work buckets,
> > and since the size came from the incoming file, the contents
> > will be loaded intact and unchanged. Then at the end of the
> > datastep (you know, down where the RUN statement is), the
> > implied OUTPUT to TABLE2 will write the contents of the PDV
> > out as the next observation in TABLE2. Of course, between
> > the SET and RUN, you may have written code to alter the
> > contents of the PDV's copy of the variables.
> >
> > Now for something different. If we add a new variable to
> > our datastep, such as K, remember that all variables are found
> > and will have a spot created in the setup of the PDV. The
> > next rule to remember is that the PDV is created, with the
> > variables, in the order that the compiler finds them, as it looks
> > through the datastep from top to bottom, left to right. In this case:
> >
> > data table2;
> > set table1;
> > K = 1;
> > run;
> >
> > will create a PDV like:
> >
> > A B C K
> > +-------+-------+-------+-------+
> > | | | | |
> > | | | | |
> > +-------+-------+-------+-------+
> >
> >
> > If we had placed the line with K earlier or used a
> > declaration statement such as LENGTH earlier, then
> > the order that K would have been found would move:
> >
> > data table2;
> > length K 8;
> > set table1;
> > K = 1;
> > run;
> >
> > will create a PDV like:
> >
> > K A B C
> > +-------+-------+-------+-------+
> > | | | | |
> > | | | | |
> > +-------+-------+-------+-------+
> >
> >
> > The size at which the PDV is created is the first size
> > that is found, such as the predefined size it can find
> > on an incoming dataset or the LENGTH statement, etc.
> > If the first occurrence found does not have an associated
> > length then the SAS default will be applied. A length of
> > 8 for numeric variables and $8 for character variables.
> > The PDV length is sized to the *first occurrence* found.
> > Later sizes found do not affect the allocation size of the
> > PDV work buckets. (PS: work buckets -- is not a formal
> > name, it is just slang used here for discussion sake, and
> > the formal book terms, which will be different names for
> > the same thing in different programming languages, is
> > not important for our discussion at this time).
> >
> > Another aspect of character variable sizing is that it
> > only assigns $8 if no length is specified and that the
> > first data-contents loaded into it cannot determine the
> > first-found-size. For example, here we have a sample
> > of code that shows that the size is based on the first
> > occurrence only. Statements have been arranged to
> > force the creation of the variable without any possibility
> > of the first content being determined either, thus you
> > get $8:
> >
> > data test0;
> > a = '123456789012345';
> > run; * result of A with length $15 ;
> >
> > data test1;
> > input a $;
> > stop;
> > set test0;
> > run; * result of A with length $8 ;
> >
> > However, if the first content length for a character variable
> > was to be found, the PDV size will be that length:
> >
> > data test;
> > S = 'a';
> > run; * result of S with length $1 ;
> >
> > The maximum size of a numeric variable in SAS is 8.
> > The minimum size of a numeric variable in SAS for some
> > operating systems such as PC, Unix, etc. is 3, and some
> > operating systems such as MVS, CMS, VSE, etc. it is 2.
> > This is due to the native storage size for numbers in each
> > operating system and how SAS has implemented them.
> >
> > Two handy tables for you:
> >
> >
> > Significant Digits and Largest Integer by Length: CMS, OS,
> > VM, MVS, VSE
> >
> > Length Significant Digits Retained Largest Integer
>Represented
> > Exactly
> > 2 2 255
> > 3 4 65,535
> > 4 7 16,777,215
> > 5 9 4,294,967,295
> > 6 12 1,099,511,627,775
> > 7 14 281,474,946,710,655
> > 8 16 72,057,594,037,927,935
> >
> >
> >
> > Storage length for Integers: PC, Unix, etc.
> >
> > Length Significant Digits Retained Largest Integer
>Represented
> > Exactly
> > 3 3 8,192
> > 4 6 2,097,152
> > 5 8 536,870,152
> > 6 11 137,438,953,472
> > 7 13 35,184,372,088,832
> > 8 15 9,007,199,254,740,992
> >
> >
> >
> > Wow, we have a lot of stuff now. So what does this all mean?
> >
> > Let's look at one more tidbit about the allocation of the PDV.
> > Not only is the size based upon the first one found, the allocation
> > of each bucket is based upon the first copy found. This means
> > that if the programmer declares/creates a variable that is what
> > the PDV creates a work bucket with. If an incoming dataset also
> > has a variable of the same name, the compiler setting up the
> > PDV sees that one is already there and does not create a
> > second work bucket, it says there is already one there to use.
> >
> > Let's go back to the original sample given below and tie all
> > of this stuff together.
> >
> > Data C;
> > Merge A B;
> > By X;
> > Run;
> >
> > The original dataset A contained two variables X and Y,
> > both of which are defaulted to numeric length 8. The original
> > dataset B contained two variables X and Z, both of which are
> > defaulted to numeric length 8.
> >
> > In the MERGE datastep, the scanning of the compiler will
> > look for the first occurrences of all variables and create a PDV
> > finding dataset A referenced first (top to bottom, left to right)
> > thus starting a PDV finding a numeric variable length 8 called
> > X and then a numeric length 8 called Y:
> >
> > X Y
> > +-------+-------+
> > | | |
> > | | |
> > +-------+-------+
> >
> > Then it finds dataset B with a numeric length 8 called X,
> > and at this time the compiler sees that it already has a
> > work bucket called X, so it skips on to the next to be
> > found which is a numeric length 8 called Z, and since
> > it doesn't find one in the PDV it creates another work
> > bucket for Z and adds it to the PDV:
> >
> >
> > X Y Z
> > +-------+-------+-------+
> > | | | |
> > | | | |
> > +-------+-------+-------+
> >
> >
> > Another goodie you'll need to know is that each work
> > bucket created for dataset variables internally retains
> > the contents (at different times) across to the next
> > iteration(observation), such as concatenating with the
> > SET statement (e.g. SET TABLE1 TABLE2; ) the
> > dataset variables in the PDV will be reset at each new
> > dataset. For interleaving (SET and BY statements) the
> > dataset variables will be reset at each new dataset and
> > at each new BY group. For MERGE with BY, the dataset
> > variables are reset at each BY group. These are some
> > of the rules denoting where resetting takes place while
> > other situations the dataset variables get retained. You
> > can check your SAS Language manual for the complete
> > particulars. I state just enough here to cover a couple
> > of the combinations in the discussion further on.
> >
> > User created variables are not retained automatically.
> > The user can specify which user created variables to be
> > retained by listing them with the RETAIN statement.
> >
> > Thus, the original posted behavior had the dataset
> > variable work buckets keep (duplicate) contents across
> > records with duplicate BY values. John's question was
> > to stop this retention of content within the BY value group
> > and only have values when there physically was a record
> > in the merged file to be loaded into the PDV work bucket.
> >
> > At this point we will see that the consideration we need
> > to be aware of (truncation) for numeric and character
> > variables (in one aspect) are inverted from each other.
> > (i.e. numeric defaults to the biggest, character to the smallest)
> > For the most part, character lengths need to be cared for
> > in the aspect of truncation since the allocation size of the
> > variable in the PDV may be of a smaller size than later
> > values to be loaded into it. Example: if a character variable
> > in the PDV gets sized to $3 and the values being loaded in
> > can be larger, then the larger values are loaded into a work
> > bucket of length 3 and thus get truncated.
> >
> > An example for the character length truncation:
> >
> > data test;
> > S = 'a'; output;
> > S = 'def'; output;
> > run; * result of S with length $1 ;
> > * and record 1 containing 'a' and ;
> > * the record 2 containing 'd' ;
> >
> >
> > The numeric default for an undeclared numeric length will
> > be 8 which is the maximum size, so even if the numeric lengths
> > being loaded in are smaller, the value will always fit. This is why
> > in the case of numerics it is possible to always be sure of the
> > expected results when a shortcut trick like below is used such
> > as the Z=.; (providing that it gets allocated to the max size and
> > not declared smaller than the incoming dataset size).
> >
> >
> > With confidence we can make this work for ourselves,
> > knowing that in the modified example:
> >
> > Data C;
> > z = .;
> > Merge A B;
> > By X;
> > Run;
> >
> > the Program Data Vector (PDV) created, finds the user's
> > copy of Z first and thus allocates a PDV like:
> >
> > Z X Y
> > +-------+-------+-------+
> > | | | |
> > | | | |
> > +-------+-------+-------+
> >
> > and that on the beginning of each iteration we assign
> > a missing value to the Z work bucket, and then the next
> > statement executes and reads (via the MERGE) the records
> > from both datasets A and B (where found, according to the
> > BY statement) and loads values for X and Y and Z if they
> > were found. If a Z value was really found and loaded, it
> > would overwrite our assignment just previously done.
> > If there wasn't a Z value found, our previously assigned
> > value of missing would remain intact. Thus the behavior
> > that John's initial request was looking for, also knowing that
> > truncation won't occur in the numeric setting, since it was
> > being loaded into a PDV bucket of maximum size for the
> > numeric.
> >
> > For character values, yes, you can also have truncation,
> > if you don't pay attention to what PDV sizes and what sizes
> > are coming in and going out, etc. But for character
> > situations you still can orchestrate and be confident in the
> > expected results if you stay aware of these rules and you
> > also *know* your data and the incoming dataset variable
> > sizes, etc. It is possible to use the LENGTH statement for
> > example, on the user's pre-initialized variable, to make sure
> > the work bucket created in the PDV is of adequate size to
> > avoid any data content corruption due to truncation.
> >
> > So always take Howard's advice and be careful, but with
> > thought and careful application of a number of aspects of
> > SAS rules and behavior, you can make both work for you,
> > if you can arrange things to occur in your desired order.
> >
> > Wouldn't it be way-cool if we could fetch an array of the
> > PDV's current attributes (i.e. length, type, retain flag, etc.).
> >
> > Hope this is helpful, and I know many parts of this will
> > be helpful in a great number of other situations you find,
> > Mark Terjeson
> > Washington State Department of Social and Health Services
> > Division of Research and Data Analysis (RDA)
> > (360) 902-0741
> > (360) 902-0705 fax
> > mailto:terjemw@dshs.wa.gov
> >
> >
> >
> >
> >
> > -
> > -----Original Message-----
> > From: Howard Schreier [SMTP:Howard_Schreier@ITA.DOC.GOV]
> > Sent: Tuesday, April 18, 2000 8:36 AM
> > To: SAS-L@LISTSERV.UGA.EDU
> > Subject: Re: Merge question
> >
> > John's original code generates this note on the log:
> >
> > MERGE statement has more than one data set with repeats of BY values.
> >
> > In my experience that usually points to a problem with at least the
> > potential for generating unwanted results.
> >
> > Mark's idea (initializing Z to missing via an assignment statement
> > before the MERGE statement) solves the problem in this case. If it's
> > possible that for some BY groups B rather than A has "extra"
> > observations, and assuming that symmetric treatment is desired, add
> >
> > y = .;
> >
> > This method could get clumsy if there are a lot of variables to
> > initialize. It also works less well for character variables, where it's
> > too easy to unintentionally alter lengths and perhaps truncate data
> > values. To illustrate, I reworked the example, making Y and Z character
> > rather than numeric.
> >
> > Data A;
> > input X Y $;
> > cards;
> > 1 1a
> > 1 2a
> > 1 3a
> > 1 4a
> > 2 1b
> > 2 2b
> > 2 3b
> > ;
> > Data B;
> > input X Z $;
> > cards;
> > 1 1xxx
> > 1 1yyy
> > 1 2zzz
> > 2 1www
> > ;
> >
> > Now try:
> >
> > Data C;
> > z= ' ';
> > Merge A B;
> > By X;
> > put x y z;
> > Run;
> >
> > All values of Z get truncated to length=1 because that's the length of
> > the constant in the assignment statement. One can attend to that
> > explicitly, but it can get to be a lot of trouble. Here's a rather
> > bizarre DATA step which takes care of the problem implicitly.
> >
> > Data C;
> > if 2+2=5 then merge a b;
> > z= ' ';
> > Merge A B;
> > By X;
> > put x y z;
> > Run;
> >
> > The first MERGE is never executed and is present only for compile-time
> > effects, in particular the determination of lengths.
> >
> > However, my recommendation is to get away from the problem of repeats
> > within BY groups altogether:
> >
> > data a2merge;
> > set a; by x;
> > if first.x then __num = 0;
> > __num + +1;
> > run;
> >
> > data b2merge;
> > set b; by x;
> > if first.x then __num = 0;
> > __num + +1;
> > run;
> >
> >
> > Data C;
> > Merge A2merge B2merge;
> > By X __num;
> > drop __num;
> > Run;
> >
> > > From: "Terjeson, Mark" <TERJEMW@DSHS.WA.GOV>
> > >
> > > Hi John,
> > >
> > > ...interesting-row-count-match...
> > >
> > > This seems to do what you desire:
> > > (make sure you initialize before the merge statement)
> > >
> > > Data C;
> > > z = .;
> > > Merge A B;
> > > By X;
> > > Run;
> > >
> > > Hope this is helpful,
> > > Mark Terjeson
> > > Washington State Department of Social and Health
> > > Services
> > > Division of Research and Data Analysis (RDA)
> > > (360) 902-0741
> > > (360) 902-0705 fax
> > > mailto:terjemw@dshs.wa.gov
> > >
> > >
> > >
> > > > -----Original Message-----
> > > > From: abcx888@MY-DEJA.COM [SMTP:abcx888@MY-DEJA.COM]
> > > >
> > > > Hi,
> > > >
> > > > I have two data sets A and B need to be match-merged.
> > > >
> > > > Dataset A: (two variables X and Y)
> > > >
> > > > X Y
> > > > 1 1
> > > > 1 2
> > > > 1 3
> > > > 1 4
> > > > 2 1
> > > > 2 2
> > > > 2 3
> > > >
> > > > Dataset B: (two variables X and Z)
> > > >
> > > > X Z
> > > > 1 1
> > > > 1 1
> > > > 1 2
> > > > 2 1
> > > >
> > > > By using following SAS code, I got the merged dataset
> > > C.
> > > >
> > > > Data C;
> > > > Merge A B;
> > > > By X;
> > > > Run;
> > > >
> > > > Dataset C: (three variables)
> > > >
> > > > X Y Z
> > > > 1 1 1
> > > > 1 2 1
> > > > 1 3 2
> > > > 1 4 2 <--
> > > > 2 1 1
> > > > 2 2 1 <--
> > > > 2 3 1 <--
> > > >
> > > > Is there a way to match merge above A and B in such way that the
>value
> > > > of Z for those extra observations in each BY group is missing
>instead
> > of
> > > > inheriting the value from previous observation? The idea result of
> > > > Dataset C is
> > > >
> > > > X Y Z
> > > > 1 1 1
> > > > 1 2 1
> > > > 1 3 2
> > > > 1 4 .
> > > > 2 1 1
> > > > 2 2 .
> > > > 2 3 .
> > > >
> > > >
> > > > Thanks
> > > >
> > > > John Wang
______________________________________________________
Get Your Private, Free Email at http://www.hotmail.com
|