| Date: | Tue, 29 Nov 2005 09:01:30 -0500 |
| Reply-To: | Peter Flom <flom@NDRI.ORG> |
| Sender: | "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU> |
| From: | Peter Flom <flom@NDRI.ORG> |
| Subject: | Re: Data Cleaning Books (may be OT) |
|
| In-Reply-To: | <OF98ADF5BA.FB026CF5-ON882570C7.007E887B-882570C8.00028430@epamail.epa.gov> |
| Content-Type: | text/plain; charset=US-ASCII |
Peter >>"Yes, 42 inches. She's a midget OK?"
David > You can often address that with multivariate methods. After all, that
42" height probably goes with a pretty low weight, and a pretty tiny shoe
size, and...
Curt >
<<<
But (and I don't offhand know how big this 'but' is), you could inject
bias in the data by only questioning values which don't match an a
priori hypothesis. It seems it would pay to know in advance what
relationships you are testing, and to base statistical tests for
incorrect values on this knowledge.
>>>
You certainly could. In the extreme case, you could reject every point that doesn't fit
your hypothesis, and thereby confirm your hypothesis, whatever the data say. :-)
But evaluating how much bias you introduce by merely questioning rather than outright
rejecting cases that are unusual, or that violate your hypothesis, is very tricky. It might be interesting
to try simulating some of the simpler possibilities......
Suppose, for example, you were only concerned with random data entry errors. Suppose you
figured that certain errors are likely (e.g. reversing digits, leaving off a digit, adding a digit). You could then
add that noise to a data set, then look at the data set again for 'outliers' and see what happens.
.......hmmmm, does anyone know if such stuff has been done?
Peter
Peter L. Flom, PhD
Assistant Director, Statistics and Data Analysis Core
Center for Drug Use and HIV Research
National Development and Research Institutes
71 W. 23rd St
http://cduhr.ndri.org
www.peterflom.com
New York, NY 10010
(212) 845-4485 (voice)
(917) 438-0894 (fax)
|