Date: Tue, 18 May 1999 09:49:28 -0400
Reply-To: Nick Vaidya <nick_vaidya@MCKENNA-GROUP.COM>
Sender: "SAS(r) Discussion" <SAS-L@UGA.CC.UGA.EDU>
From: Nick Vaidya <nick_vaidya@MCKENNA-GROUP.COM>
Subject: How to use Zip Codes as predictor with limited DF
Content-Type: text/plain; charset=us-ascii
A friend of mine offered a solution to using ZIP codes are dummy
predictor
variables in any regression model (in this case logistic) when we do not
have
enough degrees of freedom. I am not quite certain that the solution is
a viable
one. I am wondering what is your opinion on the merits and demerits of
the
approach. In particular, would you use any other approach to solve the
problem?
The Approach:
Let us say we are trying to predict who is likely to default a loan and
believe
that zip codes are important predictors. Since we do not have a huge
sample size
the number of degrees of freedom is limited. My friend suggests that we
should
convert the zip code variable into a continuos one instead of a discreet
variable.
His approach is to take the default penetration rate in each zip code
and use that
instead of the zip code. He does it by excluding the concerned
observation in the
calculation of the penetration. Thus, if there were 100 cases in a zip
code and
there were 20 defaulters, then the penetration for those who defaulted
is 19/99 and
for those who did not is 20/100.
Is this not confounding? Do you recommend this approach? What else
would you
recommend can be done in this situation?
I will appreciate your solution and opinion.
Thanks
Nick Vaidya