|
Summary: Hash example
#iw-value=1
Mary,
You have made the example of how to calculate too hard for me to
understand so I made up my own calculations to illustrate the
principle. I have also included code to make the data sets to show
you how easy it is to do when asking for help. The debugging code
is left in the example.
data Big ; * in reality has over 4 million records and 3 fields ;
input Case Rep SiteId info $ note $40. ;
cards4 ;
2 1 1 a sum_site=1,2;day=1,2,3;hour=2
2 1 2 b sum_site=2,3;day=3;hour=1,2
2 2 1 a .
2 2 3 b .
;;;;
data Small ; * in reality has 90,000 records and 5 fields. ;
input Site Day Hour HourlyAmt MaxMonthlyAmt ;
cards ;
1 1 1 5 7
1 1 2 6 7
1 2 1 7 7
1 2 2 6 7
1 3 1 5 7
1 3 2 4 7
2 1 1 15 23
2 1 2 19 23
2 2 1 20 23
2 2 2 22 23
2 3 1 22 23
2 3 2 23 23
3 1 1 11 14
3 1 2 13 14
3 2 1 14 14
3 2 2 13 14
3 3 1 12 14
3 3 2 11 14
;
data q ( keep = Case Rep SiteId note sumhours ) ;
if _n_ = 1 then
do ;
declare hash h(dataset: "work.small") ;
h.defineKey('site','day','hour') ;
h.defineData('hourlyamt') ;
h.defineDone() ;
call missing(site, day, hour, hourlyamt) ;
end ;
set big ;
put "---------" note ;
sumhours = 0 ;
if info = "a" then do ;
do site = 1, 2 ;
do day = 1 to 3 ;
do hour = 2 ;
if h.find() = 0 then
do ;
sumhours + hourlyamt ;
put site= day= hour= hourlyamt= sumhours= ;
end ;
else error "not found" ;
end ;
end ;
end ;
end ;
else
if info = "b" then do ;
do site = 2, 3 ;
do day = 3 ;
do hour = 1 to 2 ;
if h.find() = 0 then do ;
sumhours + hourlyamt ;
put site= day= hour= hourlyamt= sumhours ;
end ;
else error "not found" ;
end ;
end ;
end ;
end ;
run ;
proc print data = q ;
run ;
If you want help with real calculations then concentrate on showing
what you want.
I have no idea how you arrived at column MaxOfSummedHrlyAmts. I included my
variables INFO and NOTE because you indicated somewhere that you had more
information in big controlling the calculations.
I did not use LINKs for this simple example, but I would probably orgainze the
step with LINKs in a more complex situation. Exactly how that should be done
must be left to a time when it is clear how types of calculation there
are and what
they are and how you specify them.
Ian Whitlock
==============
Date: Wed, 18 Feb 2009 17:33:12 -0800
From: maryidahosas <maryidahosas@YAHOO.COM>
Subject: File Join and Summarize Issue - Can Hash Be Used?
I have 2 files that I wish to join and summarize. The sql I used
processes way too long. Can I use Hashing to do this, and if so,
how,
given that the distinct keys of the two files are different. I am
going to show some pseudo data to help illustrate the issues.
File Big, in reality has over 4 million records and 3 fields. Here
is a pseudo example with 4 records only:
Case Rep SiteId
2 1 1
2 1 2
2 2 1
2 2 3
File Small, in reality has 90,000 records and 5 fields. It contains
hourly data for 31 days for each SiteId. Here is a pseudo example
with just 3 sites, 3 days and 2 hours of data per siteid as an
example:
SiteId Day Hour HourlyAmt MaxMonthlyAmt
1 1 1 5 7
1 1 2 6 7
1 2 1 7 7
1 2 2 6 7
1 3 1 5 7
1 3 2 4 7
2 1 1 15 23
2 1 2 19 23
2 2 1 20 23
2 2 2 22 23
2 3 1 22 23
2 3 2 23 23
3 1 1 11 14
3 1 2 13 14
3 2 1 14 14
3 2 2 13 14
3 3 1 12 14
3 3 2 11 14
I want to join the files from Big to Small on SiteId, which will get
you in reality about 3.3Billion records. Here is what I would get
with my pseudo example:
case Rep SiteId Day Hour HrlyAmt MaxAmt
2 1 1 1 1 5 7
2 1 1 1 2 6 7
2 1 1 2 1 7 7
2 1 1 2 2 6 7
2 1 1 3 1 5 7
2 1 1 3 2 4 7
2 1 2 1 1 15 23
2 1 2 1 2 19 23
2 1 2 2 1 20 23
2 1 2 2 2 22 23
2 1 2 3 1 22 23
2 1 2 3 2 23 23
2 2 1 1 1 5 7
2 2 1 1 2 6 7
2 2 1 2 1 7 7
2 2 1 2 2 6 7
2 2 1 3 1 5 7
2 2 1 3 2 4 7
2 2 3 1 1 11 14
2 2 3 1 2 13 14
2 2 3 2 1 14 14
2 2 3 2 2 13 14
2 2 3 3 1 12 14
2 2 3 3 2 11 14
Then I want to summarize the data with by processing on Case and Rep
which should get me to get about 30,000 final records. My pseudo
example would get the following:
Case Rep MaxOfSummedHrlyAmts SumOfMaxMonthlyAmt
2 1 28 30
2 2 21 21
Since the files are huge and I actually have 48 sets of these files
that I wish to process the same way, I need it to execute very fast
and every sql permutation executes too slow.
I think Hashing is the answer, but I'm a newbie and so I never get
all the records. I believe that this is because although I would
want to create something like a left join on SiteId, the 2 files
actually have distinct "composite" keys based on 3 variables each.
Big is based on Case, Rep, and SiteId. Small is based on SiteId,
Date, and Hour. So when I built a hash, it only got 1 hour per
SiteID from Small, instead of 744 hours per SiteId. This is
probably because it ignored duplicate SiteId's in Small and only
pulls the 1 record each.
Can I use Hash? How can I do this given the "composite" keys are
different between the two? Any help and ideas would be greatly
appreciated.
Thanks in advance.
Mary
|