|
I'm not actually working with 100,000 SNP's like Ron is attempting to do; I've got 3000 within 300 Genes. But even so, what is the alternative to running these against the disease? It seems that this is what the genetics people are doing; I don't have a genetics background but I do see the problem in that DNA mapping is so new that it is difficult to know which genes are most likely to be significant. My manager does have definate theories about which genes ought to be significant (and he's usually right!), but how would someone go about confirming a theory without running all genes to see whether the others are not significant and only the ones in the theory are?
Some suggestions as to alternative approaches would be helpful.
-Mary
----- Original Message -----
From: Sigurd Hermansen
To: SAS-L@LISTSERV.UGA.EDU
Sent: Friday, December 21, 2007 10:16 AM
Subject: Re: Performing thousands of tests automatically
Even an analyst with my limited understanding of genetics has to wonder
about the value of what one would find after multiples of multiple
comparisons. It seems to me to border on bad science. I do more than a
little data mining, but I have to wonder about anecdotal evidence of
successes of blind searches for statistical significance.
Ordinary least squares regressions fit a highly restricted model and,
when dealing with single predictors such as an SNP, may not fit data
well enough by chance to reach a usual standard of statistical
significance. More flexible methods of fitting models often overfit
noise. For example, a simple model estimated in PROC MIXED using a
hundred series of values of y and 1000 sets of 100 series of values of x
(groups j=1 ... 1000),
proc printto Print=_null_ log="H:\MY
DOCUMENTS\SASPrograms\MixedStat.log";
run;
ods listing body="H:\MY DOCUMENTS\SASPrograms\MixedStat.txt";
proc mixed data=test noclprint noinfo noitprint noprofile;
model y=x;
random intercept;
by j;
ods output Tests3=SigTest;
run;
ods listing close;
proc sql;
create table SigTest as
select * from SigTest where probF<0.05
;
quit;
generated 54 instances of estimates with F-tests significant at the
0.05 level. With 100,000 groups that would likely produce 5,400
statistically significant predictions of y given x.
The source of data used in PROC MIXED ....?
data depvar (keep=i y);
do i=1 to 100;
y=round(ranuni(11131),1.);
output;
end;
run;
data test;
/* Declare hash object and read data set as ordered */
if _N_ = 1 then do;
length y 3.;
declare hash h(hashexp: 4, dataset: 'depvar', ordered: 'yes');
declare hiter iter('h');
/* Define key and data variables */
h.defineKey('i');
h.defineData('y');
h.defineDone();
/* avoid uninitialized variable notes */
call missing(i,y);
end;
/* Iterate through the hash object and output data values */
do j=1 to 1e3;
rc = iter.first();
do while (rc = 0);
x=round(ranuni(23171),1.);
z=round(ranuni(12317),1.);
output;
rc = iter.next();
end;
end;
run;
.... selected at random.
I've provided the program so you can verify correctness and test it,
change it, etc. It seems to me that a test of any model on random
observations should set a minimum standard for model uncertainty.
PS
David, while we don't pretend to speak for you or claim that you will
agree with everything that we say, we are making an effort to continue a
cause that you represented so well. Hope that you are doing well now and
enjoying the holidays.
-----Original Message-----
From: owner-sas-l@listserv.uga.edu [mailto:owner-sas-l@listserv.uga.edu]
On Behalf Of Mary
Sent: Tuesday, December 18, 2007 3:43 PM
To: Ron Do; SAS-L@LISTSERV.UGA.EDU
Subject: Re: Re: Performing thousands of tests automatically
Ron,
Yes, Bonferroni was used in the article that I cited that identified
SNP's in the CFH gene as being highly related to Macular Degeneration.
While that article may have been a fishing trip, it was later verified
by scientific theory and replication to be shown to be correct. My
manager (Dr. Greg Hageman, PHD) who discovered the CFH gene link to the
disease macular degeneration just formed a company (Optherion; though I
work for the University, not the company). The startup investment came
in a few months ago at **** 35 million dollars ****; not too bad for a
fishing trip :-).
But doing a run-through of all SNP's can only be thought of as a first
pass on the data; theory and verification must follow, and the results
that come out of such runs just give hints as to where the true
associations might actually be.
-Mary
----- Original Message -----
From: Ron Do
To: SAS-L@LISTSERV.UGA.EDU
Sent: Tuesday, December 18, 2007 2:28 PM
Subject: Re: Performing thousands of tests automatically
Bonferroni is used a lot in these instances to account for multiple
testing.
An important thing for these studies is replication of the
association
results in another independent sample.
|