LISTSERV at the University of Georgia
Menubar Imagemap
Home Browse Manage Request Manuals Register
Previous messageNext messagePrevious in topicNext in topicPrevious by same authorNext by same authorPrevious page (February 2009, week 3)Back to main SAS-L pageJoin or leave SAS-L (or change settings)ReplyPost a new messageSearchProportional fontNon-proportional font
Date:   Thu, 19 Feb 2009 12:48:12 -0500
Reply-To:   Ian Whitlock <iw1sas@GMAIL.COM>
Sender:   "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From:   Ian Whitlock <iw1sas@GMAIL.COM>
Subject:   Re: File Join and Summarize Issue - Can Hash Be Used?
Comments:   cc: maryidahosas <maryidahosas@YAHOO.COM>
Content-Type:   text/plain; charset=ISO-8859-1

Summary: Hash example #iw-value=1

Mary,

You have made the example of how to calculate too hard for me to understand so I made up my own calculations to illustrate the principle. I have also included code to make the data sets to show you how easy it is to do when asking for help. The debugging code is left in the example.

data Big ; * in reality has over 4 million records and 3 fields ; input Case Rep SiteId info $ note $40. ; cards4 ; 2 1 1 a sum_site=1,2;day=1,2,3;hour=2 2 1 2 b sum_site=2,3;day=3;hour=1,2 2 2 1 a . 2 2 3 b . ;;;;

data Small ; * in reality has 90,000 records and 5 fields. ; input Site Day Hour HourlyAmt MaxMonthlyAmt ; cards ; 1 1 1 5 7 1 1 2 6 7 1 2 1 7 7 1 2 2 6 7 1 3 1 5 7 1 3 2 4 7 2 1 1 15 23 2 1 2 19 23 2 2 1 20 23 2 2 2 22 23 2 3 1 22 23 2 3 2 23 23 3 1 1 11 14 3 1 2 13 14 3 2 1 14 14 3 2 2 13 14 3 3 1 12 14 3 3 2 11 14 ;

data q ( keep = Case Rep SiteId note sumhours ) ; if _n_ = 1 then do ; declare hash h(dataset: "work.small") ; h.defineKey('site','day','hour') ; h.defineData('hourlyamt') ; h.defineDone() ; call missing(site, day, hour, hourlyamt) ; end ;

set big ; put "---------" note ; sumhours = 0 ; if info = "a" then do ; do site = 1, 2 ; do day = 1 to 3 ; do hour = 2 ; if h.find() = 0 then do ; sumhours + hourlyamt ; put site= day= hour= hourlyamt= sumhours= ; end ; else error "not found" ; end ; end ; end ; end ; else if info = "b" then do ; do site = 2, 3 ; do day = 3 ; do hour = 1 to 2 ; if h.find() = 0 then do ; sumhours + hourlyamt ; put site= day= hour= hourlyamt= sumhours ; end ; else error "not found" ; end ; end ; end ; end ; run ;

proc print data = q ; run ;

If you want help with real calculations then concentrate on showing what you want. I have no idea how you arrived at column MaxOfSummedHrlyAmts. I included my variables INFO and NOTE because you indicated somewhere that you had more information in big controlling the calculations.

I did not use LINKs for this simple example, but I would probably orgainze the step with LINKs in a more complex situation. Exactly how that should be done must be left to a time when it is clear how types of calculation there are and what they are and how you specify them.

Ian Whitlock ==============

Date: Wed, 18 Feb 2009 17:33:12 -0800 From: maryidahosas <maryidahosas@YAHOO.COM> Subject: File Join and Summarize Issue - Can Hash Be Used?

I have 2 files that I wish to join and summarize. The sql I used processes way too long. Can I use Hashing to do this, and if so, how, given that the distinct keys of the two files are different. I am going to show some pseudo data to help illustrate the issues.

File Big, in reality has over 4 million records and 3 fields. Here is a pseudo example with 4 records only: Case Rep SiteId 2 1 1 2 1 2 2 2 1 2 2 3

File Small, in reality has 90,000 records and 5 fields. It contains hourly data for 31 days for each SiteId. Here is a pseudo example with just 3 sites, 3 days and 2 hours of data per siteid as an example: SiteId Day Hour HourlyAmt MaxMonthlyAmt 1 1 1 5 7 1 1 2 6 7 1 2 1 7 7 1 2 2 6 7 1 3 1 5 7 1 3 2 4 7 2 1 1 15 23 2 1 2 19 23 2 2 1 20 23 2 2 2 22 23 2 3 1 22 23 2 3 2 23 23 3 1 1 11 14 3 1 2 13 14 3 2 1 14 14 3 2 2 13 14 3 3 1 12 14 3 3 2 11 14

I want to join the files from Big to Small on SiteId, which will get you in reality about 3.3Billion records. Here is what I would get with my pseudo example: case Rep SiteId Day Hour HrlyAmt MaxAmt 2 1 1 1 1 5 7 2 1 1 1 2 6 7 2 1 1 2 1 7 7 2 1 1 2 2 6 7 2 1 1 3 1 5 7 2 1 1 3 2 4 7 2 1 2 1 1 15 23 2 1 2 1 2 19 23 2 1 2 2 1 20 23 2 1 2 2 2 22 23 2 1 2 3 1 22 23 2 1 2 3 2 23 23 2 2 1 1 1 5 7 2 2 1 1 2 6 7 2 2 1 2 1 7 7 2 2 1 2 2 6 7 2 2 1 3 1 5 7 2 2 1 3 2 4 7 2 2 3 1 1 11 14 2 2 3 1 2 13 14 2 2 3 2 1 14 14 2 2 3 2 2 13 14 2 2 3 3 1 12 14 2 2 3 3 2 11 14

Then I want to summarize the data with by processing on Case and Rep which should get me to get about 30,000 final records. My pseudo example would get the following: Case Rep MaxOfSummedHrlyAmts SumOfMaxMonthlyAmt 2 1 28 30 2 2 21 21

Since the files are huge and I actually have 48 sets of these files that I wish to process the same way, I need it to execute very fast and every sql permutation executes too slow.

I think Hashing is the answer, but I'm a newbie and so I never get all the records. I believe that this is because although I would want to create something like a left join on SiteId, the 2 files actually have distinct "composite" keys based on 3 variables each. Big is based on Case, Rep, and SiteId. Small is based on SiteId, Date, and Hour. So when I built a hash, it only got 1 hour per SiteID from Small, instead of 744 hours per SiteId. This is probably because it ignored duplicate SiteId's in Small and only pulls the 1 record each.

Can I use Hash? How can I do this given the "composite" keys are different between the two? Any help and ideas would be greatly appreciated.

Thanks in advance. Mary


Back to: Top of message | Previous page | Main SAS-L page