Date: Wed, 25 Jul 2007 15:05:07 -0400
Reply-To: Richard Ristow <wrristow@mindspring.com>
Sender: "SPSSX(r) Discussion" <SPSSX-L@LISTSERV.UGA.EDU>
From: Richard Ristow <wrristow@mindspring.com>
Subject: Re: Random date generator
In-Reply-To: <820E5CA81E7A2544A9D7E0A25025E4810107AFCA@e2k0305.chestnut. net>
Content-Type: text/plain; charset="us-ascii"; format=flowed
Somehow I missed or deleted the original posting in this thread.
Anyway, on Thursday, July 19, 2007 9:48 PM Hashmi, Syed S asked,
>A dataset that I'm analyzing has a set of dates for events (start and
>stop dates) as well as how long those events occured for. The data
>for each date is in three variables (month, day, year). The years are
>pretty complete if they are filled in but the month and day might are
>sometimes listed as the exact month or date and other times they're
>listed as beginning, middle or end of the year (for the month
>variable) or the month (for the day variable).
>
>I have [two dates as three variables each, plus a duration] duration).
>I have the complete start and stop date for about half the cases. The
>rest are missing either parts of one of the dates (eg. day) or for
>both. If I have one of the dates and a duration, I can calculate the
>other date.
So far, so good, though be careful about how precise your 'durations'
are.
>There is a small subset of the population where I have the complete
>stop date but am missing the start day (I have the year and month) and
>am also missing the duration. I had to come up with some way to
>impute a start date for these cases for analysis. (which will be done
>with and without these specific cases). I know that the event could
>not be more than a month long. I was planning calculate the earliest
>possible start date (e_startdt) up to a month before the stop date and
>then randomly pick a date between e_startdt and the stop date.
OUCH! I would not do this. Period.
*MAYBE* the start dates and durations you get this way will be vaguely
representative of the population of events, though I doubt it. Are your
durations roughly uniformly distributed from 0 to 30 days? For goodness
sake, you ought to check that before proceeding.
But even if they're representative of the population, they have nothing
to do with the individual cases for which they're 'imputed'. No
analysis using those 'dates' will be the least trustworthy.
A far better approach is to use true missing-value interpolation on the
*durations*, not the dates. (See SPSS 'MVA'.) I'm not clear how many
durations you'd have to impute. If it's near 50%, that won't be at all
reliable, either.
-Good luck,
Richard
|