Date: Fri, 16 Jan 2004 08:42:21 -0500
Reply-To: Howard Schreier <Howard_Schreier@ITA.DOC.GOV>
Sender: "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From: Howard Schreier <Howard_Schreier@ITA.DOC.GOV>
Subject: Re: Computing Probabilities for Day of Week, Hour of Day,
Content-Type: text/plain; charset=US-ASCII
Dave's response to Harry and me was also addressed to SAS-L, but for
some reason never made it into distribution. See my comments inline.
>>> "Dave Sorensen" <Dave.Sorensen@jur.ku.dk> 01/06/04 09:57AM >>>
>Dear Howard and Harry,
>I found a great XMas gift in my email box upon returning from the
>holidays. Howard's post of XMas Eve, Dec 24.
>I never really got to go though your code before XMas, Harry, and
>finding Howard's "tested" code when I returned from the holidays, I
>tried it and have found that it runs beautifully. So forgive me if I
>don't work my way though your code, Harry. This stuff is tough for me
>Despite your enormously informative annotations, Howard, - me being a
>novice, it took me a while to figure out exactly what was happening
>your code. Unlike my old hardcoded program - which created variables
>containing probabilities for each hour of the week (what a
>chore!), the mechanics operating inside of PROC SUMMARY are less
>transparent because there is not the same opportunity to run a proc
>print on all the millions of variables created and see whether they
>correspond to the From/To times/dates in various cases .
True. Whenever you use procs and other "black boxes" providing useful
functionality, this tends to happen. But the alternative can involve a
lot of wheel reinvention.
>I got around
>this by selecting test cases with pertinent characteristics (e.g.,
>overlapping midnight; those overlapping months, etc.), and then
>single case datasets with just one test case each and running them
>through The Machine. I did this about ten times, using ten different
>test cases, respectively. This process convinced me that your code,
>Howard, is working absolutely perfectly.
Here's a more productive way to run such a series of tests. Stack all
of the test data in one file, with an additional variable (call it
SCENARIO) to differentiate cases. Then make SCENARIO the first BY
variable in each step. Sometimes it's a bit more complicated; eg, a need
to add some reinitializations on SCENARIO boundaries in DATA steps.
>I was also happily surprised to discover that your code is far more
>sophisticated than my original 120 page hardcoded program, since (a)
>accounts for an indefinite range between the From/To dates/times, and
>(b) it creates probabilities based on the number of hours lying
>any time frame of interest. E.g.,
Often there is a choice between doing something by formula and doing it
by enumeration. Sometimes there are tradeoffs. In this case I think it's
one-sided. That's why I never suggested that Dave post his code; when he
indicated its bulk I was 99% sure that a different approach was needed,
rather than some tweaks.
Your computer should be working for you. When it really, really feels
the other way around, it's probably time to step back and tackle the
problem differently. Perhaps Jim Groenevald has stated this better in
one of his sigs
>FromTime: Wed 22:00 ToTime: Thursday 04:00 becomes Prob Wed=0.33,
>Thur=0.66. My old program had merely assigned them prob= 0.5 each.
>Sorry to ramble, but I am just tickled pink.
>THANK YOU BOTH, HOWARD AND HARRY, VERY VERY MUCH!
> -----Oprindelig meddelelse-----
>Date: Wed, 24 Dec 2003 13:03:27 -0500
>Reply-To: Howard Schreier <Howard_Schreier@ITA.DOC.GOV>
>Sender: "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
>From: Howard Schreier <Howard_Schreier@ITA.DOC.GOV>
>Subject: Re: Computing Probabilities for Day of Week, Hour of
> and Day*Hour
>Here's some tested code.
>First generate some test data:
> data demo;
> drop date n;
> do date = '01jan2002'd to '31dec2004'd;
> do n = 1 to 200*ranuni(1);
> FromDate = date - round(30*abs(rannor(1))/50);
> ToDate = date + round(30*abs(rannor(1))/50);
> FromTime = ranuni(1);
> ToTime = ranuni(1);
> if FromDate<ToDate or FromTime<ToTime then output;
> format FromDate ToDate date9.;
>The dataset size can be altered by changing the date range in the
>or the factor 200 in the inner DO, but as coded it generates about
>observations, close to Dave's dataset size.
>Now here's the code which sets up the necessary calculations. Since
>output is large (one observation per hour for each original
>and since it is needed only once, make it a view.
> data forsummary / view=forsummary;
> set demo;
> fromDT = intnx('hour',dhms(FromDate,0,0,24*60*60*FromTime),0);
> toDT = intnx('hour',dhms( ToDate,0,0,24*60*60* ToTime),0);
> fraction = 1 / (1 + intck('hour',fromDT,toDT) );
> do interpolatedTime = fromDT to toDT by 60*60;
> interpolatedDate = datepart(interpolatedTime);
> year = year(interpolatedDate);
> month = month(interpolatedDate);
> day = day(interpolatedDate);
> weekday = weekday(interpolatedDate);
> hour = hour(interpolatedTime);
> keep year month day weekday hour fraction;
>The first two assignment statements create SAS datetime variables for
>endpoints of the not-at-home intervals, rounded to whole hours. The
>24*60*60 applied to the fourth DHMS argument converts the fractional
>of day to a seconds-since-midnight convention. Then the DHMS combines
>date and time into a SAS datetime. Then the INTNX function adjusts
>these to align with the beginning of a clock hour interval (eg,
>transformed to 12:00:00).
>In the expression for FRACTION, the INTCK function counts the number
>hours in the interval. The result will generally be low by 1 because
>way times were aligned to hour boundaries. So the value for FRACTION
>indicate the proportion of each incident which should be assigned to
>single hour interval.
>The DO loop actually does the interpolation which allocates a
>incident to the hours of the not-at-home interval. It steps through
>interval an hour (60*60 seconds) at a time and generates an
>each hour. The assignment statements use appropriate date and time
>functions to derive clock and calendar granularities which are of
>Notice that there is no branching (IF, SELECT, etc.) in this DATA
>just formulas and a loop.
>Now it gets easier. PROC SUMMARY does the work.
> proc summary data=forsummary chartype;
> class year month day weekday hour;
> types year month day weekday hour
> month*hour weekday*hour month*weekday;
> output out=aggregatedInterpolations sum(fraction)=burglaries;
>The TYPES statement calls for a simple aggregation on each of the
>granularities, plus crossings which might be of interest, such as
>month*hour to look for seasonal patterns in burglars' working hours.
>The pard Schreier <Howard_Schfollowing PROC MEANS step is an adding-up
check, and also generates a
> proc means data=aggregatedInterpolations sum maxdec=0;
> class _type_;
> var burglaries;
> Analysis Variable : burglaries
> _TYPE_ Obs Sum
> 00001 24 91086
> 00010 7 91086
> 00011 168 91086
> 00100 31 91086
> 01000 12 91086
> 01001 288 91086
> 01010 84 91086
> 10000 5 91086
>For each _TYPE_ level, the values add to 91086, which (reassuringly)
>precisely the number of observations in the test dataset. The 12
>observations for _TYPE_=01000 represent the months of the year, the 7
>observations for _TYPE_=00010 represent the days of the week, and the
>observations for _TYPE_=01010 represent the crossings of these two
>Here is code to present month by day of week subtotals.
> proc tabulate data=aggregatedInterpolations;
> where _type_='01010';
> class month weekday;
> var burglaries;
> table burglaries, month all, (weekday all)*sum*f=6.;
>On Fri, 19 Dec 2003 09:52:43 -0500, Howard Schreier
>u fraction = 1 / (1 + intck('hou
>>I think Harry's on the right track. Here are some additional
>>Use the DHMS function to create datetime values for your start and
>>points. The H and M arguments will be zero and your fractional times
>>be multiplied by 24x60x60 to derive the S argument.
>>Loop from the start to end datetimes in hourly increments. Include
>>OUTPUT statement to generate hourly observations with datetimes and
>>fractions. INTCK and INTNX functions may be handy. Create additional
>>variables using functions like WEEKDAY and YEAR to set up all of the
>>granularities of interest (day of week, ISO week, calendar quarter,
>>Then use PROC SUMMARY to do all aggregations of interest. The TYPES
>>statement will allow generating only those; for example MONTH*HOUR to
>>if there is a seasonal pattern in the time of day.
>>Use WHERE filtering to pull out one aggregation at a time.
>>On Thu, 18 Dec 2003 23:59:56 +0100, Dave Sorensen
>>>I've spent weeks - and written hundreds of pages of program code -
>>following problem. Still not licked. The problem is probably due to
>>fact that I am an amateur and don't know how to routinize repetitive
>>Here's the deal.
>>>I have a big dataset - 87,000 cases - each of which concern
>>from victims on a burglary suffered in 2002.
>>>Since burglars tend to strike when residents are out of the house,
>>dataset contains "From Times" and "To Times," which represent the
>>time people told police that they left their home, and the date and
>>they returned home (and found the place burgled), respectively. The
>>date and time of the burglary thus lies somewhere between these two
>>points. 77% or the FROM/TO dates/times are within 24 hours of
>>82% within 48 hours. 96% within a week.
>>>I have computed probabilities concerning the (a) Day of Week, (b)
>>Day, and (c) Day*Hour (which combines a&b) that these crimes might
>>occurred. Day of Week was the easiest. Consider the following
>>>DAY OF WEEK
pard Schreier <Howard_Sch>>> (FROM DATE) (TO
>>>BURGLARY 1 Wed July 23 Wed July 23
>>>BURGLARY 2 Wed July 23 Thurs July 24
>>>BURGLARY 3 Wed July 23 Mon July 30
>>>The probabilities by day are as follows:
>>> PSun PMon PTues PWed PThu
>>>BURGLARY 1 0 0 0 1 0 0
>>>BURGLARY 2 0 0 0 1/2 1/2 0
>>>BURGLARY 2 1/6 1/6 0 1/6 1/6 1/6
>>>So far, I have restricted my assignments to cases where the FROM and
>>dates are within 7 days of each other. Going further than this makes
>>lot of programming code, and only results in the addition of a few
>>>HOUR OF DAY
>>>The more terrible and problematic part has been coding probabilities
>>Hour of Day. So far, I have restricted this program to cases where
>>FROM and TO times are within 24 hours of each other - because just
>>took around 100 pages of repetative program code to assign
>>The general idea is like the above. Consider the following:
>>> (FROM TIME) (TO TIME)
>>>BURGLARY 1 Mon 1pm Mon 1pm
>>>BURGLARY 2 Mon 1pm Mon 2.59pm
>>>BURGLARY 3 Mon 1pm Tues 7:59am
>>>The probabilities by day are as follows (in military time):
>>> p0 p1 p2 p3 p4 p5 p6 p7
>>p8 p9 p10 p11 p12 p13 p14 p15 p16 p17
>>p19 p20 p21 p22 p23
>>>BURGLARY 1 0 0 0 0 0 0 0
>>0 0 0 0 0 0 1 0 0
>>0 0 0 0 0 0 0 0
>>>BURGLARY 2 0 0 0 0 0 0 0
>>0 0 0 0 0 0 1/2 1/2 0
>>0 0 0 0 0 0 0 0
>>>BURGLARY 3 1/19 1/19 1/19 1/19 1/19 1/19 1/19 1/19 0
>>0 0 0 0 1/19 1/19 1/19 1/19 1/19
>>1/19 1/19 1/19 1/19 1/19
>>>DAY OF WEEK * HOUR OF DAY
>>>Programming the above was hell. I did it with a series of IF-THEN
>>statements. LOTS of pages. To make matters worse, I needed to
>>differentiate between probabilities for hours that lay on the same
>>those that lay on two consecutive dates. This is because I also want
>>compute Day*Hour so I can make a nifty graph showing how burglaries
>>and fall (probabilistically) - by hour - over the course of an
>>>Okay. It was all very terrible, but succeeded at coding Day*Hour,
>>up, down and sideways, and it worked. Over 100 pages of code.
>>>Problem is that when I compute my probabilities for Day*Hour (which
>>combines probabilities from Day of Week with Hour of Day), the
>>probabilities computed are based on the lowest common denominator of
>> which are the cases used to calculate the Hour of Day probabilities.
>>stated earlier, these probabilities were based on cases where the
>>TO times were within 24 hours of each other (thus, at most on two
>>consecutive days). But when I compare graphs of my final results for
>>of Week probabilities with Day*Hour probabilities, I can see that
>>former shows far more cases on the weekends than the latter. And
>>have realized, is because the latter uses only a 24-hour time window
>>thus drops all cases where people were burgled whilpard Schreier
<Howard_Sche away for the weekend.
>>In order to fix this bias, I need to compute Hour of Day
>>using a longer time window.
>>>I started to extend my Hour of Day probabilities to encompass a
>>period, but (a) this is a hellish coding task the way I'm doing it,
>>even this will not fix my bias entirely - since 18% of the FROM and
>>dates/times are more than 48 hours apart.
>>>So here's my question: Is there any way to do this efficiently and
>>encompass a longer period of hours? (168 hours - a week - would be
>>Or even better - the whole shebang).
>>>I have FROM and TO Times in decimal format, i.e., 1pm= 0.076923076.
>>names= FromTime and ToTime)
>>>I have FROM and TO Dates in SAS Date format. (Var names= FromDate
>>>Thanks for reading this far.
>>>And if you actually answer, thanks a million in advance.
>>>Univ. of Copenhagen