Date: Wed, 25 Jul 2007 15:09:53 -0500
Reply-To: "Hashmi, Syed S" <Syed.S.Hashmi.firstname.lastname@example.org>
Sender: "SPSSX(r) Discussion" <SPSSX-L@LISTSERV.UGA.EDU>
From: "Hashmi, Syed S" <Syed.S.Hashmi.email@example.com>
Subject: Re: Random date generator
Content-Type: text/plain; charset="us-ascii"
> -----Original Message-----
> From: Richard Ristow [mailto:firstname.lastname@example.org]
> Sent: Wednesday, July 25, 2007 2:05 PM
> >There is a small subset of the population where I have the complete
> >stop date but am missing the start day (I have the year and month)
> >am also missing the duration. I had to come up with some way to
> >impute a start date for these cases for analysis. (which will be done
> >with and without these specific cases). I know that the event could
> >not be more than a month long. I was planning calculate the earliest
> >possible start date (e_startdt) up to a month before the stop date
> >then randomly pick a date between e_startdt and the stop date.
> OUCH! I would not do this. Period.
> *MAYBE* the start dates and durations you get this way will be vaguely
> representative of the population of events, though I doubt it. Are
> durations roughly uniformly distributed from 0 to 30 days? For
> sake, you ought to check that before proceeding.
> But even if they're representative of the population, they have
> to do with the individual cases for which they're 'imputed'. No
> analysis using those 'dates' will be the least trustworthy.
> A far better approach is to use true missing-value interpolation on
> *durations*, not the dates. (See SPSS 'MVA'.) I'm not clear how many
> durations you'd have to impute. If it's near 50%, that won't be at all
> reliable, either.
> -Good luck,
Thanks for your input. I realize that I was stepping into extremely
treacherous territory when I decide to impute dates and select random
ones. As for the durations being roughly uniformly distributed, that's
what it looks like from the data I do have. Initially, I'd assumed that
durations would have a mean of about 7 days but somehow the data I do
have doesn't seem to show that. It's more or less uniformly
distributed. There were some durations that were >30 days but I doubt
if they're true. Therefore, I decided to go ahead with the uniform
distribution (although, the whole imputation and random selection still
The reason that I'm trying to get an idea about the dates, especially
the event start dates, is due to the nature of the study question. I'm
looking at the occurrence of certain events during pregnancy. However,
these events of interest have to occur within the first trimester, or if
I narrow it down further, the first two months of pregnancy. Therefore,
I have to know if an event occurred within a certain period of time
after the last menstrual date as reported by the woman. At the end of
the day, the variables for all the events get filtered down to a single
dichotomous variable - Y/N did the event occur during the period of
I will do the analysis with and without the cases where the dates have
been imputed from incomplete data. I hadn't previously thought of using
true-missing value interpolation on the durations but I'll look into it.
I've never done that before so will have to read up a bit on it. I
might have an issue with number of missings though, since more cases
have at least some part of the date then a duration value.
Thanks again for your advice. It's always nice to get a fresh look at an