LISTSERV at the University of Georgia
Menubar Imagemap
Home Browse Manage Request Manuals Register
Previous messageNext messagePrevious in topicNext in topicPrevious by same authorNext by same authorPrevious page (June 2011, week 3)Back to main SAS-L pageJoin or leave SAS-L (or change settings)ReplyPost a new messageSearchProportional fontNon-proportional font
Date:         Tue, 21 Jun 2011 11:32:04 -0400
Reply-To:     Bolotin Yevgeniy <YBolotin@SCHOOLS.NYC.GOV>
Sender:       "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From:         Bolotin Yevgeniy <YBolotin@SCHOOLS.NYC.GOV>
Subject:      Re: finding maximum of multiple variables over multiple date
              windows
In-Reply-To:  <DCCF5C4C413228459F79656548489F6D07CB5B2527@EX3VS1.nyced.org>
Content-Type: text/plain; charset="us-ascii"

Slight simplification: for purposes of this calculation, missing values can be treated as '0' (i.e. we only care about the '1' outcomes)

-----Original Message----- From: Bolotin Yevgeniy Sent: Tuesday, June 21, 2011 11:30 AM To: SAS-L@LISTSERV.UGA.EDU Subject: finding maximum of multiple variables over multiple date windows

my google-fu is insufficient to find anything, although I'm reasonably sure it's been done before...

I have a reasonably large (20 million-ish records) dataset with about a dozen variables I need to aggregate (sample data at the end of the email. each variable is trinary: character '0', '1', or missing)

For each of these variables, I need to find its maximum value in the last 2 years, last 3 years, ... etc up to 8, for a given set of years

Requirements: * should be easy to read/understand/modify (primary requirement) and, * if at all possible, should be reasonably fast (i have a variant which does this with a few macro loops and a bunch of PROC SQLs, but this takes about 3 hours to run - it should theoretically be doable much faster by doing them all in one pass, rather than one at a time like it's being done now...)

I tried writing a single-datastep version with macro logic, but the result consisted almost entirely of macro variables and practically no code, and was completely unreadable.

Sample input data:

data HAVE; *note: dates may not be continuous - e.g. id 1 is missing year 2007; input id year a :$1. b :$1. c :$1. d :$1. e :$1. f :$1.; datalines; 1 2006 1 0 0 1 1 1 1 2008 1 0 0 0 1 1 1 2009 0 0 . 0 1 1 1 2010 0 0 0 0 1 1 1 2011 0 0 0 0 1 0 1 2012 0 0 0 0 1 0 2 2008 0 1 1 0 1 1 2 2009 1 1 1 0 0 0 ; Run;

/* Output should have something like the following variables: id year max_of_last_2_years_a max_of_last_3_years_a max_of_last_4_years_a ... max_of_last_8_years_a max_of_last_2_years_b ... max_of_last_8_years_f;

with the following values for e.g. max_of_last_3_years_a: id year max_of_last_3_years_a comments 1 2006 1 value was 1 in the last 3 years inclusive (in this case, just in 2006) 1 2008 1 value was 1 in the last 3 years inclusive (in 2006 and 2008). note: no entry for 2007, so it's not in the output 1 2009 1 value was 1 in the last 3 years inclusive (in 2008; 2006 is now too old to be considered for this variable) 1 2010 1 value was 1 in the last 3 years inclusive (in 2008) 1 2011 0 value was not 1 in the last 3 years inclusive (in 2009, 2010 or 2011) 1 2012 0 value was not 1 in the last 3 years inclusive (in 2010, 2011 or 2012) 2 2008 0 value was not 1 in the last 3 years inclusive (in 2008) 2 2009 1 value was 1 in the last 3 years inclusive (in 2009) */


Back to: Top of message | Previous page | Main SAS-L page