Even an analyst with my limited understanding of genetics has to wonder
about the value of what one would find after multiples of multiple
comparisons. It seems to me to border on bad science. I do more than a
little data mining, but I have to wonder about anecdotal evidence of
successes of blind searches for statistical significance.
Ordinary least squares regressions fit a highly restricted model and,
when dealing with single predictors such as an SNP, may not fit data
well enough by chance to reach a usual standard of statistical
significance. More flexible methods of fitting models often overfit
noise. For example, a simple model estimated in PROC MIXED using a
hundred series of values of y and 1000 sets of 100 series of values of x
(groups j=1 ... 1000),
proc printto Print=_null_ log="H:\MY
ods listing body="H:\MY DOCUMENTS\SASPrograms\MixedStat.txt";
proc mixed data=test noclprint noinfo noitprint noprofile;
ods output Tests3=SigTest;
ods listing close;
create table SigTest as
select * from SigTest where probF<0.05
generated 54 instances of estimates with F-tests significant at the
0.05 level. With 100,000 groups that would likely produce 5,400
statistically significant predictions of y given x.
The source of data used in PROC MIXED ....?
data depvar (keep=i y);
do i=1 to 100;
/* Declare hash object and read data set as ordered */
if _N_ = 1 then do;
length y 3.;
declare hash h(hashexp: 4, dataset: 'depvar', ordered: 'yes');
declare hiter iter('h');
/* Define key and data variables */
/* avoid uninitialized variable notes */
/* Iterate through the hash object and output data values */
do j=1 to 1e3;
rc = iter.first();
do while (rc = 0);
rc = iter.next();
.... selected at random.
I've provided the program so you can verify correctness and test it,
change it, etc. It seems to me that a test of any model on random
observations should set a minimum standard for model uncertainty.
David, while we don't pretend to speak for you or claim that you will
agree with everything that we say, we are making an effort to continue a
cause that you represented so well. Hope that you are doing well now and
enjoying the holidays.
From: firstname.lastname@example.org [mailto:email@example.com]
On Behalf Of Mary
Sent: Tuesday, December 18, 2007 3:43 PM
To: Ron Do; SAS-L@LISTSERV.UGA.EDU
Subject: Re: Re: Performing thousands of tests automatically
Yes, Bonferroni was used in the article that I cited that identified
SNP's in the CFH gene as being highly related to Macular Degeneration.
While that article may have been a fishing trip, it was later verified
by scientific theory and replication to be shown to be correct. My
manager (Dr. Greg Hageman, PHD) who discovered the CFH gene link to the
disease macular degeneration just formed a company (Optherion; though I
work for the University, not the company). The startup investment came
in a few months ago at **** 35 million dollars ****; not too bad for a
fishing trip :-).
But doing a run-through of all SNP's can only be thought of as a first
pass on the data; theory and verification must follow, and the results
that come out of such runs just give hints as to where the true
associations might actually be.
----- Original Message -----
From: Ron Do
Sent: Tuesday, December 18, 2007 2:28 PM
Subject: Re: Performing thousands of tests automatically
Bonferroni is used a lot in these instances to account for multiple
An important thing for these studies is replication of the
results in another independent sample.