Date: Tue, 21 Jun 2011 11:32:04 -0400
Reply-To: Bolotin Yevgeniy <YBolotin@SCHOOLS.NYC.GOV>
Sender: "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From: Bolotin Yevgeniy <YBolotin@SCHOOLS.NYC.GOV>
Subject: Re: finding maximum of multiple variables over multiple date
windows
In-Reply-To: <DCCF5C4C413228459F79656548489F6D07CB5B2527@EX3VS1.nyced.org>
Content-Type: text/plain; charset="us-ascii"
Slight simplification: for purposes of this calculation, missing values
can be treated as '0' (i.e. we only care about the '1' outcomes)
-----Original Message-----
From: Bolotin Yevgeniy
Sent: Tuesday, June 21, 2011 11:30 AM
To: SAS-L@LISTSERV.UGA.EDU
Subject: finding maximum of multiple variables over multiple date
windows
my google-fu is insufficient to find anything, although I'm reasonably
sure it's been done before...
I have a reasonably large (20 million-ish records) dataset with about a
dozen variables I need to aggregate (sample data at the end of the
email. each variable is trinary: character '0', '1', or missing)
For each of these variables, I need to find its maximum value in the
last 2 years, last 3 years, ... etc up to 8, for a given set of years
Requirements:
* should be easy to read/understand/modify (primary requirement) and,
* if at all possible, should be reasonably fast
(i have a variant which does this with a few macro loops and a bunch of
PROC SQLs, but this takes about 3 hours to run - it should theoretically
be doable much faster by doing them all in one pass, rather than one at
a time like it's being done now...)
I tried writing a single-datastep version with macro logic, but the
result consisted almost entirely of macro variables and practically no
code, and was completely unreadable.
Sample input data:
data HAVE; *note: dates may not be continuous - e.g. id 1 is missing
year 2007;
input
id year a :$1. b :$1. c :$1. d :$1. e :$1. f :$1.;
datalines;
1 2006 1 0 0 1 1 1
1 2008 1 0 0 0 1 1
1 2009 0 0 . 0 1 1
1 2010 0 0 0 0 1 1
1 2011 0 0 0 0 1 0
1 2012 0 0 0 0 1 0
2 2008 0 1 1 0 1 1
2 2009 1 1 1 0 0 0
;
Run;
/*
Output should have something like the following variables:
id year max_of_last_2_years_a max_of_last_3_years_a
max_of_last_4_years_a ... max_of_last_8_years_a max_of_last_2_years_b
... max_of_last_8_years_f;
with the following values for e.g. max_of_last_3_years_a:
id year max_of_last_3_years_a comments
1 2006 1 value was 1 in the last 3 years inclusive (in
this case, just in 2006)
1 2008 1 value was 1 in the last 3 years inclusive (in
2006 and 2008). note: no entry for 2007, so it's not in the output
1 2009 1 value was 1 in the last 3 years inclusive (in
2008; 2006 is now too old to be considered for this variable)
1 2010 1 value was 1 in the last 3 years inclusive (in
2008)
1 2011 0 value was not 1 in the last 3 years inclusive
(in 2009, 2010 or 2011)
1 2012 0 value was not 1 in the last 3 years inclusive
(in 2010, 2011 or 2012)
2 2008 0 value was not 1 in the last 3 years inclusive
(in 2008)
2 2009 1 value was 1 in the last 3 years inclusive (in
2009)
*/