Date: Thu, 14 May 1998 12:48:21 +0200
Reply-To: Paul Dickman <Paul.Dickman@ONKPAT.KI.SE>
Sender: "SAS(r) Discussion" <SAS-L@UGA.CC.UGA.EDU>
From: Paul Dickman <Paul.Dickman@ONKPAT.KI.SE>
Subject: Re: Proc Genmod - Scale Parameter
Content-Type: text/plain; charset="us-ascii"
Martin,
The scale parameter is related to overdispersion. It's impossible to explain
the scale parameter without getting into some of the theory of GLMs. I'll
start with some theory and then try and explain the practical implications.
For a simple linear model (ordinary least squares) we have:
y_i = x'b + e_i
where e_i ~ N(0,sigma^2)
Var(y_i)=sigma^2
That is, we assume that the variance of the response is identical for all
combinations of the covariates and equal to sigma^2, which is estimated from
the data.
For a generalised linear model (GLM) we have:
g(u_i)=x'b, where u_i=E(y_i) and g is the link function.
and Var(y_i)=phi V(u_i)
That is, the variance of y_i is equal to some constant phi (the scale
parameter) times a function of the expectation of y_i.
If we specify a GLM with a normal error structure and identity link (that
is, a simple linear model), V(u_i) is set to one and Var(y_i) is constant
for all values of y_i. In this case, the scaled deviance is our estimate of
sigma^2.
If we are fitting a Poisson regression model (which I assume you are doing),
we have the variance equal to the mean. That is Var(y_i)=E(y_i)=u_i. By
default, the model is fitted in GENMOD under the assumption that the data
were generated by a Poisson process, that is, Var(y_i)=u_i and the scale
parameter (phi) is set equal to one. The estimate of the scale parameter is
therefore reported as 1.000 and a note written in the output file that the
scale parameter was fixed.
To reiterated, in the simple linear model, the y_i's are assumed to have
constant variance. The variance of the y_i's is estimated from the data and
can take any value greater than zero. In the Poisson model, the y_i's do not
have constant variance. The variance of y_i is assumed to be equal to the
expectation of y_i, where the expectation of y_i is estimated from the data.
For the Poisson model, the covariance matrix, and hence the standard errors
of the parameter estimates, are estimated under the assumption that the
Poisson model is appropriate. Occasionally we may observe more variation in
the response than what is expected by the Poisson assumption. This is called
overdispersion and means that the estimates of the standard errors of the
parameters will not be correct. Overdispersion typically occurs when the
observations are correlated. Underdispersion (less variation than expected)
is also possible, although not as common.
You can identify possible overdispersion by dividing the deviance by its
degrees of freedom (called the dispersion parameter). If the deviance is
equal to the df (scale parameter=1) then there is no evidence of
overdispersion. Note that a scale parameter not equal to one does not
necessarily mean overdispersion. This can also indicate other problems,
such as an incorrectly specified model or outliers in your data. An
incorrectly specified model can be due to an incorrectly specified
functional form (an additive rather than a multiplicative model may be
appropriate) or, more likely, that important explanatory variables (or
interactions) are missing from your model.
In most cases, lack of fit (identified by deviance > df) is due to missing
explanatory variables (or interactions) from the model.
If you believe you have a correctly specified model, and the deviance is
greater than the df, then you conclude that your data are overdispersed. You
should be able to identify a reason why your data are overdispersed. If you
don't correct for the overdispersion, then inference will be biased due to
underestimated standard errors.
There are a variety of ways of correcting for the overdispersion, one of the
simplest being to scale the covariance matrix by a constant. That is,
instead of Var(y_i)=u_i, we assume Var(y_i)=phi * u_i, where phi is greater
than 1 for an overdispersed model. The scale parameter (phi) can be
estimated by the square root of the deviance divided by the df, which can be
done in GENMOD by specifying DSCALE as an option to the model statement.
Any good book on GLMs will include a discussion on overdispersion and how to
identify and adjust for it. See
<http://www.maths.uq.edu.au/~gks/research/glm/books.html> for a list of
references. David Collett gives a very good general overview
(non-mathematical) of overdispersion and methods of adjusting for it for
case of binomial outcomes in his book 'Modelling Binary Data' (Chapman and
Hall 1993).
Paul Dickman
---
Paul Dickman, Paul.Dickman@onkpat.ki.se
Cancer Epidemiology Unit, Radiumhemmet,
Karolinska Hospital, 171 76 Stockholm, Sweden
Ph: +46 8 5177 5375 Fax: +46 8 326 113
At 10.29 1998-05-14 +0200, you wrote:
>Hi SAS-Lers,
>
>We are using this proc to build a multiplicative model based on a set of =
>variables <parameters> (e.g. gender, marital status etc.).
>In the output of PROC GENMOD there is an intercept and a set of relative =
>factors for each value of each parameter. The intercept represents the =
>observed frequency and each of the relative factors is used to adjust =
>this frequency according to the specific combination of parameter values =
>in a particular observation.
>i.e. Frequency =3D Intercept * Relative Factor for Gender * Relative =
>Factor for Marital Status * ......=20
>
>In the output, however, there is a 'parameter' called SCALE which is =
>automatically output (in the same way as the intercept is output =
>automatically).
>
>What does this scale value represent and how should we allow for it in =
>our multiplicative model? Should we allow for it at all?
>
>Any help would be greatly appreciated
>
>Martin
>
>Martint@hollard.co.za
>
>