Date: Sun, 14 Mar 1999 13:09:37 -0500
Reply-To: hermans1 <hermans1@WESTAT.COM>
Sender: "SAS(r) Discussion" <SAS-L@UGA.CC.UGA.EDU>
From: hermans1 <hermans1@WESTAT.COM>
Subject: Re: SQL: One more related question: SQL question Cartesian P
Content-Type: text/plain; charset=US-ASCII
Using the collating sequence of the gene variable, try
proc sql;
create table data3 as
select
d1.gene1 as gene1,d1.marke1 as marke1,d1.alle1 as alle1,
d2.gene1 as gene2,d2.marke1 as marke2,d2.alle1 as alle2
from data1 d1,data1 d2
where d1.marke1=d2.marke1
and d1.gene1<d2.gene1
order by marke1,gene2
;
quit;
Note that I am using a reflexive join on one table rather than copying
the table prior to the join. I believe that the solution matches what
you want, except perhaps the order of the last couple of rows. Sig
______________________________ Reply Separator _________________________________
Subject: SQL: One more related question: SQL question Cartesian Produ
Author: "Seltzer; Jon D." <SeltzerJD@PHIBRED.COM> at Internet-E-Mail
Date: 3/14/99 1:39 AM
First my thanks to everyone that responded to my original question, you are
all helping me get out of a monster programming jam.
All of the solutions were orders of magnitude more efficient than what I was
trying to do. Hopefully I am on my way to writing a program that takes
1/2hour to run Vs 40 hours to run.
Below is my original question.
In working on this problem a bit more, I suddenly realized
in reality I want to get only one record from a representative pair from the
output at the bottom
of this email. I thought all pairs would be okay but now I realize that is
not the case.
Once again any help would be greatly appreciated.
So now the final data set I would like to obtain is:
gene1 mark1 alle1 gene2 mark2 alle2
A 1 .1 B 1 .4
A 1 .1 C 1 .7
B 1 .4 C 1 .7
A 2 .2 B 2 .5
A 2 .2 C 2 .8
B 2 .5 C 2 .8
A 3 .3 B 3 .6
A 3 .3 C 3 .9
B 3 .6 C 3 .9
If you compare this to the past final data set below you will notice that
the second record has been deleted in each pair of records. I really have
no clue how to do
this without lots of data manipulation. The actual file I am working on is
2-4 million records,
which is why I am looking for more efficient code.
The original problem.
Suppose data1 is:
gene1 marke1 alle1
A 1 .1
A 2 .2
A 3 .3
B 1 .4
B 2 .5
B 3 .6
C 1 .7
C 2 .8
C 3 .9
Data2 is the same as data1 but with different variable names.
gene2 marke2 alle2
A 1 .1
A 2 .2
A 3 .3
B 1 .4
B 2 .5
B 3 .6
C 1 .7
C 2 .8
C 3 .9
What I want to create is data3: For each marker, I want all pairs of gene1
and gene2 excluding the case when gene1=gene2. I would also be happy with
just the unique pairs if that is easier to do.
gene1 mark1 alle1 gene2 mark2 alle2
A 1 .1 B 1 .4
B 1 .4 A 1 .1
A 1 .1 C 1 .7
C 1 .7 A 1 .1
B 1 .4 C 1 .7
C 1 .7 B 1 .4
A 2 .2 B 2 .5
B 2 .5 A 2 .2
A 2 .2 C 2 .8
C 2 .8 A 2 .2
B 2 .5 C 2 .8
C 2 .8 B 2 .5
A 3 .3 B 3 .6
B 3 .6 A 3 .3
A 3 .3 C 3 .9
C 3 .9 A 3 .3
B 3 .6 C 3 .9
C 3 .9 B 3 .6