LISTSERV at the University of Georgia
Menubar Imagemap
Home Browse Manage Request Manuals Register
Previous messageNext messagePrevious in topicNext in topicPrevious by same authorNext by same authorPrevious page (January 2004, week 3)Back to main SAS-L pageJoin or leave SAS-L (or change settings)ReplyPost a new messageSearchProportional fontNon-proportional font
Date:         Fri, 16 Jan 2004 08:42:21 -0500
Reply-To:     Howard Schreier <Howard_Schreier@ITA.DOC.GOV>
Sender:       "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From:         Howard Schreier <Howard_Schreier@ITA.DOC.GOV>
Subject:      Re: Computing Probabilities for Day of Week, Hour of Day,
              and Day*Hour
Comments: cc: Dave.Sorensen@jur.ku.dk
Content-Type: text/plain; charset=US-ASCII

Dave's response to Harry and me was also addressed to SAS-L, but for some reason never made it into distribution. See my comments inline.

>>> "Dave Sorensen" <Dave.Sorensen@jur.ku.dk> 01/06/04 09:57AM >>> >Dear Howard and Harry, > > >I found a great XMas gift in my email box upon returning from the >holidays. Howard's post of XMas Eve, Dec 24. > >I never really got to go though your code before XMas, Harry, and upon >finding Howard's "tested" code when I returned from the holidays, I >tried it and have found that it runs beautifully. So forgive me if I >don't work my way though your code, Harry. This stuff is tough for me to >decipher. > > >Despite your enormously informative annotations, Howard, - me being a >novice, it took me a while to figure out exactly what was happening in >your code. Unlike my old hardcoded program - which created variables >containing probabilities for each hour of the week (what a programming >chore!), the mechanics operating inside of PROC SUMMARY are less >transparent because there is not the same opportunity to run a proc >print on all the millions of variables created and see whether they >correspond to the From/To times/dates in various cases .

True. Whenever you use procs and other "black boxes" providing useful functionality, this tends to happen. But the alternative can involve a lot of wheel reinvention.

>I got around >this by selecting test cases with pertinent characteristics (e.g., those >overlapping midnight; those overlapping months, etc.), and then creating >single case datasets with just one test case each and running them >through The Machine. I did this about ten times, using ten different >test cases, respectively. This process convinced me that your code, >Howard, is working absolutely perfectly.

Here's a more productive way to run such a series of tests. Stack all of the test data in one file, with an additional variable (call it SCENARIO) to differentiate cases. Then make SCENARIO the first BY variable in each step. Sometimes it's a bit more complicated; eg, a need to add some reinitializations on SCENARIO boundaries in DATA steps.

> > > >I was also happily surprised to discover that your code is far more >sophisticated than my original 120 page hardcoded program, since (a) it >accounts for an indefinite range between the From/To dates/times, and >(b) it creates probabilities based on the number of hours lying within >any time frame of interest. E.g.,

Often there is a choice between doing something by formula and doing it by enumeration. Sometimes there are tradeoffs. In this case I think it's one-sided. That's why I never suggested that Dave post his code; when he indicated its bulk I was 99% sure that a different approach was needed, rather than some tweaks.

Your computer should be working for you. When it really, really feels the other way around, it's probably time to step back and tackle the problem differently. Perhaps Jim Groenevald has stated this better in one of his sigs

> >FromTime: Wed 22:00 ToTime: Thursday 04:00 becomes Prob Wed=0.33, Prob >Thur=0.66. My old program had merely assigned them prob= 0.5 each. > > >Sorry to ramble, but I am just tickled pink. > >THANK YOU BOTH, HOWARD AND HARRY, VERY VERY MUCH! >Dave S. > > -----Oprindelig meddelelse----- >Date: Wed, 24 Dec 2003 13:03:27 -0500 >Reply-To: Howard Schreier <Howard_Schreier@ITA.DOC.GOV> >Sender: "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU> >From: Howard Schreier <Howard_Schreier@ITA.DOC.GOV> </cgi-bin/wa?A2=ind0312D&L=sas-l&D=0&P=19666> >Subject: Re: Computing Probabilities for Day of Week, Hour of Day, > and Day*Hour > >Here's some tested code. > > >First generate some test data: > > > data demo; > drop date n; > do date = '01jan2002'd to '31dec2004'd; > do n = 1 to 200*ranuni(1); > FromDate = date - round(30*abs(rannor(1))/50); > ToDate = date + round(30*abs(rannor(1))/50); > FromTime = ranuni(1); > ToTime = ranuni(1); > if FromDate<ToDate or FromTime<ToTime then output; > end; > end; > format FromDate ToDate date9.; > run; > > >The dataset size can be altered by changing the date range in the outer DO >or the factor 200 in the inner DO, but as coded it generates about 90,000 >observations, close to Dave's dataset size. > > >Now here's the code which sets up the necessary calculations. Since the >output is large (one observation per hour for each original observation) >and since it is needed only once, make it a view. > > > data forsummary / view=forsummary; > set demo; > fromDT = intnx('hour',dhms(FromDate,0,0,24*60*60*FromTime),0); > toDT = intnx('hour',dhms( ToDate,0,0,24*60*60* ToTime),0); > fraction = 1 / (1 + intck('hour',fromDT,toDT) ); > do interpolatedTime = fromDT to toDT by 60*60; > interpolatedDate = datepart(interpolatedTime); > year = year(interpolatedDate); > month = month(interpolatedDate); > day = day(interpolatedDate); > weekday = weekday(interpolatedDate); > hour = hour(interpolatedTime); > output; > end; > keep year month day weekday hour fraction; > run;

> > >The first two assignment statements create SAS datetime variables for the >endpoints of the not-at-home intervals, rounded to whole hours. The factor >24*60*60 applied to the fourth DHMS argument converts the fractional times >of day to a seconds-since-midnight convention. Then the DHMS combines the >date and time into a SAS datetime. Then the INTNX function adjusts each of >these to align with the beginning of a clock hour interval (eg, 12:34:56 is >transformed to 12:00:00). > > >In the expression for FRACTION, the INTCK function counts the number of >hours in the interval. The result will generally be low by 1 because of the >way times were aligned to hour boundaries. So the value for FRACTION will >indicate the proportion of each incident which should be assigned to a >single hour interval. > > >The DO loop actually does the interpolation which allocates a burglary >incident to the hours of the not-at-home interval. It steps through the >interval an hour (60*60 seconds) at a time and generates an observation for >each hour. The assignment statements use appropriate date and time >functions to derive clock and calendar granularities which are of interest. > > >Notice that there is no branching (IF, SELECT, etc.) in this DATA step, >just formulas and a loop. > > >Now it gets easier. PROC SUMMARY does the work. > > > proc summary data=forsummary chartype; > class year month day weekday hour; > types year month day weekday hour > month*hour weekday*hour month*weekday; > output out=aggregatedInterpolations sum(fraction)=burglaries; > run; > > >The TYPES statement calls for a simple aggregation on each of the >granularities, plus crossings which might be of interest, such as >month*hour to look for seasonal patterns in burglars' working hours. > > >The pard Schreier <Howard_Schfollowing PROC MEANS step is an adding-up check, and also generates a >useful summary. > > > proc means data=aggregatedInterpolations sum maxdec=0; > class _type_; > var burglaries; > run; > > >Result: > > > Analysis Variable : burglaries > > > N > _TYPE_ Obs Sum > > > 00001 24 91086 > 00010 7 91086 > 00011 168 91086 > 00100 31 91086 > 01000 12 91086 > 01001 288 91086 > 01010 84 91086 > 10000 5 91086 > > >For each _TYPE_ level, the values add to 91086, which (reassuringly) is >precisely the number of observations in the test dataset. The 12 >observations for _TYPE_=01000 represent the months of the year, the 7 >observations for _TYPE_=00010 represent the days of the week, and the 84 >observations for _TYPE_=01010 represent the crossings of these two CLASS >variables. > > >Here is code to present month by day of week subtotals. > > > proc tabulate data=aggregatedInterpolations; > where _type_='01010'; > class month weekday; > var burglaries; > table burglaries, month all, (weekday all)*sum*f=6.; > run; > > >On Fri, 19 Dec 2003 09:52:43 -0500, Howard Schreier ><Howard_Schreier@ITA.DOC.GOV> wrote: > >u fraction = 1 / (1 + intck('hou >>I think Harry's on the right track. Here are some additional thoughts. >> >>Use the DHMS function to create datetime values for your start and end >>points. The H and M arguments will be zero and your fractional times should >>be multiplied by 24x60x60 to derive the S argument. >> >>Loop from the start to end datetimes in hourly increments. Include an >>OUTPUT statement to generate hourly observations with datetimes and the >>fractions. INTCK and INTNX functions may be handy. Create additional >>variables using functions like WEEKDAY and YEAR to set up all of the >>granularities of interest (day of week, ISO week, calendar quarter, etc.). >> >>Then use PROC SUMMARY to do all aggregations of interest. The TYPES >>statement will allow generating only those; for example MONTH*HOUR to see >>if there is a seasonal pattern in the time of day. >> >>Use WHERE filtering to pull out one aggregation at a time. >> >>On Thu, 18 Dec 2003 23:59:56 +0100, Dave Sorensen <Dave.Sorensen@JUR.KU.DK> >>wrote: >> >>>Dear SAS-L >>> >>>I've spent weeks - and written hundreds of pages of program code - on the >>following problem. Still not licked. The problem is probably due to the >>fact that I am an amateur and don't know how to routinize repetitive code. >>Here's the deal. >>> >>>I have a big dataset - 87,000 cases - each of which concern information >>from victims on a burglary suffered in 2002. >>>Since burglars tend to strike when residents are out of the house, the >>dataset contains "From Times" and "To Times," which represent the date and >>time people told police that they left their home, and the date and time >>they returned home (and found the place burgled), respectively. The actual >>date and time of the burglary thus lies somewhere between these two >>points. 77% or the FROM/TO dates/times are within 24 hours of eachother. >>82% within 48 hours. 96% within a week. >>> >>>I have computed probabilities concerning the (a) Day of Week, (b) Hour of >>Day, and (c) Day*Hour (which combines a&b) that these crimes might have >>occurred. Day of Week was the easiest. Consider the following cases: >>> >>>DAY OF WEEK pard Schreier <Howard_Sch>>> (FROM DATE) (TO DATE) >>>BURGLARY 1 Wed July 23 Wed July 23 >>>BURGLARY 2 Wed July 23 Thurs July 24 >>>BURGLARY 3 Wed July 23 Mon July 30 >>> >>>The probabilities by day are as follows: >>> >>> PSun PMon PTues PWed PThu PFri >>PSat >>>BURGLARY 1 0 0 0 1 0 0 0 >>>BURGLARY 2 0 0 0 1/2 1/2 0 0 >>>BURGLARY 2 1/6 1/6 0 1/6 1/6 1/6 >1/6 >>> >>>So far, I have restricted my assignments to cases where the FROM and TO >>dates are within 7 days of each other. Going further than this makes for a >>lot of programming code, and only results in the addition of a few more >>cases. >>> >>>HOUR OF DAY >>>The more terrible and problematic part has been coding probabilities for >>Hour of Day. So far, I have restricted this program to cases where the >>FROM and TO times are within 24 hours of each other - because just this >>took around 100 pages of repetative program code to assign probabilities. >>The general idea is like the above. Consider the following: >>> >>> (FROM TIME) (TO TIME) >>>BURGLARY 1 Mon 1pm Mon 1pm >>>BURGLARY 2 Mon 1pm Mon 2.59pm >>>BURGLARY 3 Mon 1pm Tues 7:59am >>> >>>The probabilities by day are as follows (in military time): >>> >>> p0 p1 p2 p3 p4 p5 p6 p7 >>p8 p9 p10 p11 p12 p13 p14 p15 p16 p17 p18 >>p19 p20 p21 p22 p23 >>>BURGLARY 1 0 0 0 0 0 0 0 >>0 0 0 0 0 0 1 0 0 >>0 0 0 0 0 0 0 0 >>>BURGLARY 2 0 0 0 0 0 0 0 >>0 0 0 0 0 0 1/2 1/2 0 >>0 0 0 0 0 0 0 0 >>>BURGLARY 3 1/19 1/19 1/19 1/19 1/19 1/19 1/19 1/19 0 >>0 0 0 0 1/19 1/19 1/19 1/19 1/19 1/19 >>1/19 1/19 1/19 1/19 1/19 >>> >>> >>>DAY OF WEEK * HOUR OF DAY >>>Programming the above was hell. I did it with a series of IF-THEN DO >>statements. LOTS of pages. To make matters worse, I needed to >>differentiate between probabilities for hours that lay on the same day and >>those that lay on two consecutive dates. This is because I also want to >>compute Day*Hour so I can make a nifty graph showing how burglaries rise >>and fall (probabilistically) - by hour - over the course of an average >week. >>> >>>Okay. It was all very terrible, but succeeded at coding Day*Hour, QC'd it >>up, down and sideways, and it worked. Over 100 pages of code. >>> >>>Problem is that when I compute my probabilities for Day*Hour (which >>combines probabilities from Day of Week with Hour of Day), the >>probabilities computed are based on the lowest common denominator of >cases - >> which are the cases used to calculate the Hour of Day probabilities. As >>stated earlier, these probabilities were based on cases where the FROM and >>TO times were within 24 hours of each other (thus, at most on two >>consecutive days). But when I compare graphs of my final results for Day >>of Week probabilities with Day*Hour probabilities, I can see that the >>former shows far more cases on the weekends than the latter. And that, I >>have realized, is because the latter uses only a 24-hour time window - and >>thus drops all cases where people were burgled whilpard Schreier <Howard_Sche away for the weekend. >>In order to fix this bias, I need to compute Hour of Day probabilities >>using a longer time window. >>> >>>I started to extend my Hour of Day probabilities to encompass a 48-hour >>period, but (a) this is a hellish coding task the way I'm doing it, and (b) >>even this will not fix my bias entirely - since 18% of the FROM and TO >>dates/times are more than 48 hours apart. >>> >>>So here's my question: Is there any way to do this efficiently and >>encompass a longer period of hours? (168 hours - a week - would be great. >>Or even better - the whole shebang). >>>I have FROM and TO Times in decimal format, i.e., 1pm= 0.076923076. (Var >>names= FromTime and ToTime) >>>I have FROM and TO Dates in SAS Date format. (Var names= FromDate and >>ToDate) >>> >>>Thanks for reading this far. >>>And if you actually answer, thanks a million in advance. >>> >>>Dave Sorensen >>>US Ex-Pat >>>Univ. of Copenhagen


Back to: Top of message | Previous page | Main SAS-L page