| Date: | Tue, 11 Aug 2009 14:36:48 -0400 |
| Reply-To: | Sigurd Hermansen <HERMANS1@WESTAT.COM> |
| Sender: | "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU> |
| From: | Sigurd Hermansen <HERMANS1@WESTAT.COM> |
| Subject: | Re: Low concordance in weighted sample |
|
| In-Reply-To: | <200908111647.n7BAo2C0013389@mailgw.cc.uga.edu> |
| Content-Type: | text/plain; charset="us-ascii" |
Hema:
You ask
>> >
1. Why is the concordance low in weighted sample? Since concordance is the number of pairs where the predicted probility of positive response is greater than predicted probility of no response 2. Is there a way by which we can get the concordance of the sample/universe using weights?
<<<
As a rule statisticians use a standardized statistic, such as the c statistic on a 0 to 1 scale, to estimate concordance. Instead of sampling non-events and weighting, I've summarized all obs and weighted by group n's. For that specification, SAS PROC LOGISTIC output includes a correct c statistic and other correct concordance statistics as well. This example comes from my response to a similar question earlier this year:
>>>>
-----Original Message-----
From: Sigurd Hermansen
Sent: Wednesday, May 20, 2009 9:29 PM
To: 'Liang Xie'; 'SAS-L@LISTSERV.UGA.EDU'
Subject: RE: Low Event Rate Predictive Modeling
<snip>
A typical predictive model would represent in the abstract the process that generates the event that you are trying to predict. It may be a function of predictors with errors. A statistical model represents both the function of predictors and the distribution of errors. For example, a logistic regression model postulates a linear function of predictors that predicts, more or less well, a logit of an event (outcome). In another words, a logistic regression model maps a function of predictors to a logistic distribution (with values between 0 and 1) that represents the probability that an event of interest will occur given the values of the predictors. The predictions differ from observed outcomes in a sample by an error rate.
Suppose, for instance, that in the program below we control the error rate (err) initially at 2% of a standard normal (mean=0, variance=1) distribution. For a sample size of 100, we compute a statistically insignificant beta estimate at the <=5% risk level (Pr>X**2)=>0.10 and estimated OR confidence interval of {0.977,1.289}, but an strong AUC estimate of 0.904 (as reflected also by the graph of the ROC curve):
____________________________________________________
%let b1=%str(0.02);
%let err=%str(0.02);
%let sampleSize=100;
ods trace on;
data test;
do i=1 to &sampleSize ;
bodyweight=80 + ranuni(321457)*100;
fx=(&b1*bodyweight);
efx=exp(fx);
cefx=1-efx;
L=-(1+exp(fx))** -1 + normal(55469)* &err ;
if L>0 then disease=1;
else disease=0;
output;
end;
run;
proc sql;
create table test as
select distinct disease,round(bodyweight,1.) as bodyweight,count(*) as n
from test group by disease,calculated bodyweight
;
quit;
proc logistic data=test outmodel=BWModel descending;
model disease = bodyweight/ctable pprob = (0 to 1 by .10) outroc=roc ;
freq n;
ods output ParameterEstimates=ParameterEstimates;
run;
proc logistic inmodel=BWModel;
score data=test out=scores;
run;
symbol1 i=join v=none c=blue;
proc gplot data=roc;
plot _sensit_*_1mspec_=1/vaxis=0 to 1 by .1 ;
run;
ods trace off;
quit;
______________________________________________________
The small number of observed events in the sample decreases the power of model statistics and affects directly the fit statistics. Too accurate a predictive function reduces the number of observed events of interest below a critical number. Adding errors to data actually improves model fit statistics, though not the c statistic estimate of AUC. Doubling the value of the error rate to 4% (%let err=%str(0.04);) paradoxically improves the fit statistics while decreasing the value of the c statistic. Predictions from small samples of a large set of data may fall into this trap.
>>>>
I would recommend the summarization and weighting method of data reduction over the sampling and weighting method, except when using repeated resampling to reduce out of sample prediction errors. For a SAS Macro that computes c statistics on weighted data, see my Predictive Modelling, Part I paper from SESUG 2008. It includes a corrected version of a SAS Macro that also appears, with a sum instead of the correct product of counts, in my SGF 2008 paper on Predictive Modelling.
S
-----Original Message-----
From: SAS(r) Discussion [mailto:SAS-L@LISTSERV.UGA.EDU] On Behalf Of SUBSCRIBE SAS-L Hema
Sent: Tuesday, August 11, 2009 12:47 PM
To: SAS-L@LISTSERV.UGA.EDU
Subject: Low concordance in weighted sample
Hi All,
I am working on Logistic Model. Our universe is of millions and the binary
positive reponse for a particular product usually in our universe is very
less. For example we have 500 positive (1's) responders only. Hence our
sampling is such that
- The sample should contain all the responders
- Responders should be 10 or 20% of the sample (which ever is possible)
- For selection of non responders, we sample the remaining number of
observations from the non responders
Hence sample=total responders+sample of non responders.
Because we sample in this way, we use weights.The problem occurs at the
logistic stage. The number of pairs (for concordance) that the logistic
output shows is of sample data, whereas the estimates is of the weighted
data i.e the universe.
My questions are :-
1. Why is the concordance low in weighted sample? Since concordance is the
number of pairs where the predicted probility of positive response is
greater than predicted probility of no response
2. Is there a way by which we can get the concordance of the
sample/universe using weights?
Help Appreciated.
Hema
|