Date: Thu, 26 Apr 2007 14:42:03 -0400
Reply-To: Chang Chung <chang_y_chung@HOTMAIL.COM>
Sender: "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From: Chang Chung <chang_y_chung@HOTMAIL.COM>
Subject: Re: How to do a reverse filling (missing observations)?
On Thu, 26 Apr 2007 17:37:02 +0000, toby dunn <tobydunn@HOTMAIL.COM> wrote:
>Hmmm that seems to me like processing the data three time which is atleast 1
>times too many.
>
>>From: Muthia Kachirayan <muthia.kachirayan@GMAIL.COM>
>>A small change of Barry's code and the use of double do-loops gives a clean
>>output. The FIRST do-loop carries the YEAR from the last record of the
>>group(cnt) and enters into the SECOND do-loop and propagates the YEAR for
>>each record of the group(cnt). Again the control goes to top of the data
>>step to process the records of the next group and so on.
>>
>>Muthia Kachirayan
>>
>>data a;
>> infile datalines;
>> input @1 prodid 2. @4 custid 2. @7 year 4.;
>> retain cnt 1;
>> output;
>> if year > . then cnt + 1;
>>datalines;
>>1 1
>>2 1
>>3 1 1998
>>2 1
>>3 1
>>4 1 1999
>>4 2
>>5 2
>>6 2 1998
>>;
>>run;
>>
>>data want(drop = cnt);
>> do until (last.cnt);
>> set a;
>> by cnt;
>> end;
>> do until (last.cnt);
>> set a(drop = year);
>> by cnt;
>> output;
>> end;
>>run;
Hi,
Muthia's code is cute. I like the clean look of the second data step, which
utilizes a variant of the "Double DoW" technique. The classic "Double Dow"
usually calculates some kinds of by-group summary stats in the first DoW
loop and then attaches the stats to each and every observation in the
by-group using the second DoW loop.
Here we don't need to calculate the summary stats since the "summary stats"
are just the last (within the by-group) observation values! Now, you might
say, "Wait a minute! We only want to attach/replace the year value and
nothing else!" Well, don't worry. The second set statement over-writes
everything with the original values, except the year variable because it is
dropped in the second loop's set statement option. Overall it is quite
elegant piece of code!
Practically though, I am bothered by the fact that the input dataset does
not have a record id -- especially since it is obvious that the order of
observations means something. I would rather create a record id first,
unless it takes too much in terms of either/both time and storage.
If the dataset is small enough, then we cannot do better than just run the
whole data step backward only once updating in-place. Well, I don't find
myself updating a dataset in-place often -- it is too risky. So at least to
me, this remains as an exercise -- it is good to know that it is possible,
just in case I need it someday.
Cheers,
Chang
data a;
infile datalines missover;
input @1 prodid 2. @4 custid 2. @7 year 4.;
datalines;
1 1
2 1
3 1 1998
2 1
3 1
4 1 1999
4 2
5 2
6 2 1998
;
run;
data a;
do i=nobs to 1 by -1;
modify a nobs=nobs point=i;
if missing(year) then do;
year = sYear;
replace;
end; else
sYear = year;
end;
stop;
run;
proc print data=a;
run;
/* on lst
Obs prodid custid year
1 1 1 1998
2 2 1 1998
3 3 1 1998
4 2 1 1999
5 3 1 1999
6 4 1 1999
7 4 2 1998
8 5 2 1998
9 6 2 1998
*/