Date: Sun, 4 Aug 1996 15:41:51 EDT
Reply-To: whitloi1@WESTATPO.WESTAT.COM
Sender: "SAS(r) Discussion" <SAS-L@UGA.CC.UGA.EDU>
From: Ian Whitlock <whitloi1@WESTATPO.WESTAT.COM>
Subject: Re: sql help
Subject: sql help
Summary: Find the minimum relevant year not spent in an educational
institution.
Respondent: Ian Whitlock <whitloi1@westat.com>
A.A. McCulloch <aam22@CUS.CAM.AC.UK> asks an interesting "sets" type
of SQL question. We used the UNION operator and the EXCEPT operator.
The example is simple enough to follow yet rich enough to be
informative.
>I have three data files which contain the dates when people started
>and left education of three different types (school, college,
>university). I would like to calculate the first year (>16th birth)
>when each person was not in any sort of education. I've been trying
>to use SQL but don't know where I should start. Below is data giving
>16th birthday and start and end dates of the three classes of
>education.
The first thing to realize is that SQL must have available appropriate
years. (When you ask for something that is not an orange you must have
in mind a universe of possibilities from which to choose.) I took the
simplest route and said the possible years range from 16th birthday
plus 1 to 95 since all the dates seemed to end in 1995 or before. (At
the risk of complexity one might choose to narrow the list of years.)
Here is the code. The code to generate Andrews data is given at the
bottom.
/* generate list of relevant years */
data years ( keep = id yr ) ;
set b16 ;
by id ;
do yr = birth16 + 1 to 95 ;
output ;
end ;
run ;
Now we are prepared to think about SQL. First we want to eliminate
years in an institution and then pick the minimum remaining year. How
do we get the years in an institution? Match the possible years
against records for a type of institution and grab any year within any
records limit. Note that the same year may be listed more than once
for an institution or even for more than one institution. We don't
care. The we take the union of the three types of years. This is the
view V created in the first step. Then we eliminate these years from
consideration and take the minimum of the remaining to provide an
answer when there is one. This is view W created in the second step.
Finally we need to do a left join with the IDs from B16 to pick up
anyone who had all their years eliminated. (One could of course
eliminate this last step by adding enough relevant years to the YEARS
data set.)
Here is the SQL code.
proc sql ;
/* list years in any educational institution */
create view v as
select id , yr
from
( select *
from ( select distinct m.id , m.yr
from years as m , school
where m.id = school.id and
m.yr between strt and end )
)
union corresponding
( select *
from ( select distinct m.id , m.yr
from years as m , college
where m.id = college.id and
m.yr between strt and end )
)
union corresponding
( select *
from ( select distinct m.id , m.yr
from years as m , univ
where m.id = univ.id and
m.yr between strt and end )
)
;
/* get the minimum relevant year not in an educational */
/* institution */
create view w as
select id , min ( yr ) as yr
from
( select id , yr from years
except corresponding
select id , yr from v
)
group by id
order by id
;
title
"First year after 16th birthday out of educational system" ;
select b.id , w.yr
from b16 as b left join w
on b.id = w.id
;
quit ;
Originally I tried to intersect the years out of each type of
institution. This approach did not work since there may be multiple
records for one ID in a type of institution.
The code to generate the data sets referred to above is:
/* test data */
data b16 ; input
id birth16 ; cards ;
11 86
12 84
13 47
14 65
15 90
16 64
17 69
18 72
19 81
20 82
;
data school ; input
ID STRT END ; cards ;
11 84 86
11 88 90
12 82 83
13 45 55
14 85 87
15 88 89
16 62 72
16 77 87
17 67 68
18 83 86
;
data college ; input
ID STRT END ; cards ;
11 90 91
12 84 85
13 93 93
14 64 65
14 88 90
15 94 95
16 93 94
17 83 85
18 88 90
19 89 92
;
data univ ; input
ID STRT END ; cards ;
11 87 93
12 91 94
13 46 47
14 65 69
15 90 93
16 92 94
17 79 83
18 71 74
19 80 83
20 81 84
;
I guess there wasn't a place for this type of question on SAS-L in the
"old days" because the then proper response "RTFM" would have been too
dull to hold interest.
Ian Whitlock