LISTSERV at the University of Georgia
Menubar Imagemap
Home Browse Manage Request Manuals Register
Previous messageNext messagePrevious in topicNext in topicPrevious by same authorNext by same authorPrevious page (September 2005)Back to main SPSSX-L pageJoin or leave SPSSX-L (or change settings)ReplyPost a new messageSearchProportional fontNon-proportional font
Date:         Mon, 19 Sep 2005 14:47:16 -0400
Reply-To:     Richard Ristow <wrristow@mindspring.com>
Sender:       "SPSSX(r) Discussion" <SPSSX-L@LISTSERV.UGA.EDU>
From:         Richard Ristow <wrristow@mindspring.com>
Subject:      Tracing data corrections, and "data in code"
In-Reply-To:  <s329cfd7.054@GWMAIL01.LOYOLA.EDU>
Content-Type: text/plain; charset="us-ascii"; format=flowed

For accuracy, even for honesty, EVERY data value in EVERY file must be traceable: "data is read by Yield.SPS from file Wheat_Project_Yield.DAT, on CD-ROM "Project data 06/13/2005". (DOCUMENT is good for recording this. All of us, including me, under-use DOCUMENT.)

That means that, once values are read, no transformation program should change them. By all means use transformations; but always create new variables, never modify old ones. Don't even modify TRANSFORMED variables. You don't want the same variable name to mean different things, especially not subtly different things, in different files. Among other confusions, it makes tracing back horrendously difficult, even if it's still possible.

This came up recently. (Martin Sherman, sorry to use you for an example.) It's a followup to thread "Re: converting a missing time value into a real time value", earlier this month. From an off-list exchange:

>>At 03:52 PM 9/15/2005, [Martin Sherman] wrote: >> >>>Both of these work [to specify the value "9 hours"]. mfs >>>if q1 eq 52724937 @55btime_ eq 9*3600. >>>IF q1 eq 52724937 @55btime_ = TIME.HMS(9). >> >>[Richard asked,] WHERE did the number "52724937" come from?

At 07:47 PM 9/15/2005, Martin Sherman wrote: >That number is an ID number. I was cleaning my data set and checking >the times [respondents] provided when certain events transpired. [This >one] had a value for time that didn't make too much sense. So I pull >it out and requested that the RAs recheck the raw protocol. Lo and >behold the value should have been 9:00.

OK. It's very easy to correct an erroneous value or so like that: . IF q1 eq 52724937 @55btime_ = TIME.HMS(9). Or even a set of them: . IF (ID=123456) VAR017=3.14159. . IF (ID=234567) VAR121=1.41426. I confess: I have been tempted, and I have sometimes fallen. Part of the temptation is, "I'm not changing it; I'm correcting it to what it should have been in the first place." Which is fine, if you're infallible.

This practice is "data in code," and can be very, very hard to trace. In the file you run statistics against, the value of VAR017 comes from the source data in cases 123455 and 123457, from your program in case 123456, and there's NO indication. Nobody will know it's been changed, unless they compare your file with the source data, and that's laborious. (HINT: it's much easier in SAS.) And if they suspect you've done "data in code" correction, they must search all your transformation programs to find where you've done it. And then, since you were in a hurry in the first place, instead of writing * Following value corrected from review of paper forms, 07/21/2005. . IF (ID=123456) VAR017=3.14159. you probably just wrote . IF (ID=123456) VAR017=3.14159. and all they know is, you thought that was a good idea.

(Data should be like dog shows: "You can't show it without a pedigree.")

There's no easy solution for a source file with a few needed corrections. List-members, what do you recommend, and do? -> If you ask for a corrected source file, it'll take a long time, and the errors are likely to be just as bad, if different. -> You could create a file of corrections, and use UPDATE. -> Or, I might do "data in code" after all. But I'd treat it as data, to be just as documented and just as disciplined. Something like this, and note reference to Xerox copies of source documents. For variable CORR01, why value 9='Logic error'? To catch cases where, . The DO IF construct was not processed because a test returned MISSING . A DO IF or ELSE IF block for an ID value failed to set CORR01=1.

/* ......... YieldX.SPS .......... */ /* 19 Sep 2005 */ /* Corrections to Yield.SAV, */ /* from Wheat_Project_Yield.DAT; */ /* for source, see DOCUMENT in the */ /* input file. */

GET FILE=<path>\YIELD.SAV. DOCUMENT YieldX.SAV is a corrected - version of Yield.SAV; see its - DOCUMENT (which should precede - this note) for details. - - Changed records are to have - variable CORR01=1. - - For changes and justification, - see YieldX.SPS, 19 Sep 2005.

NUMERIC CORR01 (F2). VAR LABEL CORR01 'Case revised by YieldX.SPS, 19 Sep 2005'. VAL LABEL CORR01 0 'Unchanged' 1 'Changed' 9 'Logic error'. COMPUTE CORR01 = 9. DO IF ID = 123456. . COMPUTE CORR01 = 1. * Initial value of VAR017 seemed implausible. * Corrected by inspection of original data- . * capture form; Xerox attached to listing . * of this program. . COMPUTE VAR017=3.14159. ELSE IF ID = 234567. . COMPUTE CORR01 = 1. * Letter from investigator 29 Aug 2005 gave . * this correction (Xerox attached) . . COMPUTE VAR121=1.41426. ELSE. . COMPUTE CORR01 = 0. END IF.

FREQUENCIES CORR01. TEMPORARY. SELECT IF CORR01 NE 0. LIST ID CORR01. SAVE OUTFILE=<path>\YIELD.SAV.


Back to: Top of message | Previous page | Main SPSSX-L page