Date: Thu, 5 Aug 1999 12:32:35 -0400
Reply-To: Peter Flom <peter.flom@NDRI.ORG>
Sender: "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From: Peter Flom <peter.flom@NDRI.ORG>
Subject: Re: Real stats on real big data?
Content-Type: text/plain; charset=US-ASCII
>>> "Berryhill, Tim" <TWB2@PGE.COM> 08/05/99 11:50AM >>> wrote
>>>Let me start by saying I haven't been paid for statistical work for 15 years
>>>or more, so take this with some skepticism.
>>>3) A curious thing I have noticed with large datasets, which perhaps argues
>>>in favor of samples, is that with 20 M obs every difference is significant.
>>>I expect this is based on my incorrect application of statistics--assuming a
>>>distribution is normal when in fact there is a minimum and such. It wasn't
>>>a problem back in Oregon when we had N's of 17 or 288.
My reply
It is true that every thing is significant with really large N, but this is not because
of any incorrect application of statistcs, it is inherent in the process. As you get
larger N, you get more precise estimates of the population, so you are able to
detect smaller effects
So, if you are doing (say) a t-test, you will be able to detect very small differences
between means. Since the means of the two populations are never EXACTLY
equal, with large enough N you will always find a difference between two samples.
Whether that difference is meaningful for any practical purpose is another matter.
Peter Flom, Ph.D.
Principal Research Associate
NDRI
2 World Trade Center
16th floor
New York, NY 10048
(212) 845-4485 (voice)
(212) 845-4698 (fax)
Peter.Flom@ndri.org
|