LISTSERV at the University of Georgia
Menubar Imagemap
Home Browse Manage Request Manuals Register
Previous messageNext messagePrevious in topicNext in topicPrevious by same authorNext by same authorPrevious page (November 2005, week 5)Back to main SAS-L pageJoin or leave SAS-L (or change settings)ReplyPost a new messageSearchProportional fontNon-proportional font
Date:   Tue, 29 Nov 2005 09:01:30 -0500
Reply-To:   Peter Flom <flom@NDRI.ORG>
Sender:   "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From:   Peter Flom <flom@NDRI.ORG>
Subject:   Re: Data Cleaning Books (may be OT)
Comments:   To: Curt Seeliger <Seeliger.Curt@EPAMAIL.EPA.GOV>
In-Reply-To:   <OF98ADF5BA.FB026CF5-ON882570C7.007E887B-882570C8.00028430@epamail.epa.gov>
Content-Type:   text/plain; charset=US-ASCII

Peter >>"Yes, 42 inches. She's a midget OK?"

David > You can often address that with multivariate methods. After all, that 42" height probably goes with a pretty low weight, and a pretty tiny shoe size, and...

Curt > <<< But (and I don't offhand know how big this 'but' is), you could inject bias in the data by only questioning values which don't match an a priori hypothesis. It seems it would pay to know in advance what relationships you are testing, and to base statistical tests for incorrect values on this knowledge. >>>

You certainly could. In the extreme case, you could reject every point that doesn't fit your hypothesis, and thereby confirm your hypothesis, whatever the data say. :-)

But evaluating how much bias you introduce by merely questioning rather than outright rejecting cases that are unusual, or that violate your hypothesis, is very tricky. It might be interesting to try simulating some of the simpler possibilities......

Suppose, for example, you were only concerned with random data entry errors. Suppose you figured that certain errors are likely (e.g. reversing digits, leaving off a digit, adding a digit). You could then add that noise to a data set, then look at the data set again for 'outliers' and see what happens.

.......hmmmm, does anyone know if such stuff has been done?

Peter

Peter L. Flom, PhD Assistant Director, Statistics and Data Analysis Core Center for Drug Use and HIV Research National Development and Research Institutes 71 W. 23rd St http://cduhr.ndri.org www.peterflom.com New York, NY 10010 (212) 845-4485 (voice) (917) 438-0894 (fax)


Back to: Top of message | Previous page | Main SAS-L page