```Date: Wed, 2 Jul 2008 09:11:11 -0500 Reply-To: Mary Sender: "SAS(r) Discussion" From: Mary Subject: Re: Update info in data set based on other data set Comments: To: Jan Sunde Content-Type: text/plain; charset="iso-8859-1" Jan, Could you please post your code that you actually tried? I actually gave two different solutions yesterday; perhaps try the one that you didn't try before? The one late in the day yesterday (with the UNION) was tested. -Mary ----- Original Message ----- From: Jan Sunde To: SAS-L@LISTSERV.UGA.EDU Sent: Wednesday, July 02, 2008 5:47 AM Subject: Re: Update info in data set based on other data set Thank you everyone for your help. I really appreciate your effort and your elaborate answers. However, I am not quite there yet. The Cartesian product was obviously the problem here - my bad! I am a novice to "SQL thinking" in SAS and still learning. However, neither Mary's and Ian's solutions are sufficient, although elegant. This could be just as well due to my lack of description of the data sets involved, so I will give a more thorough description: Dataset A has ids and dates that, in combination, are unique identifiers. This combination may, or may not, appear in dataset B, which only contains repeated measures. In case the unique combination date+id is found in B, the variable varx ("not repeated") in set A is set to 0. Otherwise, varx=1. Dates or ids from A may however appear in B in -other- combinations, but are then not counted as repeated measures. No observations in A or B have NULL values for date and id. I want dataset C to have all observations from dataset A + the new variable varx. There are 509759 observations in dataset A, and 108385 observations in B. This gives a proportion of repeats of approx. 21.2% which is the expected count for varx=1. Mary's solution gives the right number of observations in dataset C, namely 509759. However, all varx are set to 1. Similarly, Ian's solution also gives the correct number of observations but varx in this case is set to 0 for all observations. Clearly, something is still wrong, but I can't see what.... On Tue, 1 Jul 2008 21:32:04 +0000, Ian Whitlock wrote: >Summary: Finer points of cartesian joins >#iw-value=1 > >Jan, > >Mary is correct about your mistake. Suppose both A and B have >a modest hundred thousand observations. Then the Cartesian >product has 10 billion obs to create and check so it is very >easy to run out of resources even with modest amounts of data. > >Now I don't know about your data and whether ID/DATE is unique, but >Mary's code will generate extra records although it is unlikely to >run out of resources. > >Consider: > > data a ; > do obs = 1 to 20 ; > date = int("1jul2008"d + 10 * (ranuni(945543)-.5)) ; > if ranuni(0) < .5 then id = 1 ; > else id = 2 ; > output ; > end ; > format date date9. ; > run ; > > data b ; > do date = "1jul2008"d - 3, "1jul2008"d +1 , "1jul2008"d + 3 ; > id = 1 ; > output ; > id = 2 ; > output ; > output ; > end ; > run ; > >Instead of a CASE clause I prefer a simpler boolean expression > > ( a.date=q.date and a.id = q.id ) as flag > >However that is really a stylistic choice. Now the log > >614 proc sql ; >615 create table updates as >616 select a.* , ( a.date=q.date and a.id = q.id ) as flag >617 from a left join (select /* DISTINCT */ date, id from b) as q >618 on a.date = q.date and a.id = q.id >619 ; >NOTE: Table WORK.UPDATES created, with 25 rows and 4 columns. > >620 create table updates as >621 select a.* , ( a.date=q.date and a.id = q.id ) as flag >622 from a left join (select DISTINCT date, id from b) as q >623 on a.date = q.date and a.id = q.id >624 ; >NOTE: Table WORK.UPDATES created, with 20 rows and 4 columns. > >shows in the first case 5 observations were manufactured. In other >words, the DISTINCT modifier on selecting from B is necessary unless >you know there are no cases where ID and DATE match in B. > >Ian Whitlock >================ > >Date: Tue, 1 Jul 2008 09:07:37 -0500 >Reply-To: Mary >Sender: "SAS(r) Discussion" >From: Mary >Subject: Re: Update info in data set based on other data set >Comments: To: Jan Sunde >Content-Type: text/plain; charset="iso-8859-1" > >Jan, > >The problem is that you are trying to do a massive join of the two >tables- you have nothing like > >select a.*,b.* >from a >join >b >on a.id=b.id > >So since you have no on statement indicating how to join the two >tables, you are doing a cross-join of all records by all records, >which is not what you want anyway. > >A left outer join might accomplish this: > >proc sql ; >create table c as >select a.*, >case when a.id=b.id and a.date=b.date > then 0 > else 1 >end as varx >from datasetA as a, >left outer join >datasetB as b >on a.id=b.id and a.date=b.date; >quit; > >Here the date and/or id should be null if it doesn't find the record >in table b, so I think this might work (not tested). > >-Mary > > > > > ----- Original Message ----- > From: Jan Sunde > To: SAS-L > Sent: Tuesday, July 01, 2008 8:27 AM > Subject: Update info in data set based on other data set > > > Hi > I have a problem as follows : > I want to create a dataset C that copies all of data set A > as well as adds an additional variable (varx) as 0/1 based on > whether an > observation id in dataset A occurs in dataset B (on the same date). > I have tried a PROC SQL as follows but with no success: > > proc sql ; > create table c as > select a.*, > case when a.id=b.id and a.date=b.date > then 0 > else 1 > end as varx > from datasetA as a, datasetB as b; > quit; > > I get a timeout (OUT OF RESOURCES error). > Any idea on what I am doing wrong here? ```

Back to: Top of message | Previous page | Main SAS-L page