LISTSERV at the University of Georgia
Menubar Imagemap
Home Browse Manage Request Manuals Register
Previous messageNext messagePrevious in topicNext in topicPrevious by same authorNext by same authorPrevious page (March 2008, week 4)Back to main SAS-L pageJoin or leave SAS-L (or change settings)ReplyPost a new messageSearchProportional fontNon-proportional font
Date:         Wed, 26 Mar 2008 16:15:17 -0400
Reply-To:     "Luo, Peter" <Peter.Luo@DRAFTFCB.COM>
Sender:       "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From:         "Luo, Peter" <Peter.Luo@DRAFTFCB.COM>
Subject:      Re: PROC LOGISTIC MODEL--Standardize vars?
Comments: To: Tom White <tw2@MAIL.COM>
In-Reply-To:  A<20080326185211.C564C478077@ws1-5.us4.outblaze.com>
Content-Type: text/plain; charset="us-ascii"

When you say you have 5m records, here m refers to 1000 or 1 million (people do use M for thousand)? If it is a 1000 then forget over-sampling; otherwise you have 50,000 bad cases, then pull another 50,000 good cases and work on a sample of 100,000 (the extra time you spend on just 'running' a model with 5 millions records than a model with 100,000 records could well be spent elsewhere).

-----Original Message----- From: SAS(r) Discussion [mailto:SAS-L@LISTSERV.UGA.EDU] On Behalf Of Tom White Sent: Wednesday, March 26, 2008 2:52 PM To: SAS-L@LISTSERV.UGA.EDU Subject: Re: PROC LOGISTIC MODEL--Standardize vars?

Peter Lou writes below:

"I won't build my model on a data consists of %1 1s and 99% 0s. Draw a sample with all 1s and reasonable number of 0s."

Peter, if I do that, which is what I was proposing, then, as you say, the base probability will change. And that's the only thing that will change?

So, then, if I do over-sampling by selecting all of my 1s and some number of 0s so that my TRAINING data set will consist, say, of 10% 1s and 90% 0s, is this appropriate ratio to catch fraudulent claims?

As you say, these claims will still rank-order correctly, so I won't have to adjust the intercept (base probability) for my altered sample of 10% 1s and 90% 0s. (Rememebr, in real life, it is only 1% 1s and 99% 0s.)

So, then, Peter, what might be a ratio to work with?

I will chose all of my 1s and then HOW MANY 0s should I choose for the problem I am trying to solve?

Thnak you.

T

----- Original Message ----- From: "Luo, Peter" To: "Tom White" , SAS-L@LISTSERV.UGA.EDU Subject: RE: Re: PROC LOGISTIC MODEL--Standardize vars? Date: Wed, 26 Mar 2008 13:56:29 -0400

I'm not sure if IV's range will affect the regression estimation. It's not like in cluster analysis that the variable with larger scale will be unfairly given more 'weight' in the 'distance' because of the way the distance is constructed. In regression of any kind, the algorithm is looking for the slope of DV against IV. A simple dummy IV may demonstrate steeper differentiation on DV at its two value point than does an IV with more values and larger range.

Weighting will only affect the intercept (the base probability), not the coefficients of the IVs. So yes, if you don't adjust, you still get the right rank-order.

I won't build my model on a data consists of %1 1s and 99% 0s. Draw a sample with all 1s and reasonable number of 0s.

-----Original Message----- From: SAS(r) Discussion [mailto:SAS-L@LISTSERV.UGA.EDU] On Behalf Of Tom White Sent: Wednesday, March 26, 2008 12:20 PM To: SAS-L@LISTSERV.UGA.EDU Subject: Re: PROC LOGISTIC MODEL--Standardize vars?

Thank you Sig, Peter, and Chang.

Just to make sure I am clear in my mind as to what the three of you are telling me.

(1) Since I will fit a LOGISTIC model, the DV is, of course, 0s and 1s(0=No fraud, 1=Fraud). We will use historical data to build the model and use it to score new claims. So, as Peter said, we are doing prediction.

(2) My available historical dataset has about 5 years worth of claims which boils down to about 30K unique (i.e. no duplicates) data points (i.e. doctors who submit claims to us). This, in turn, translates to around 5M or so claims. Since I am trying to predict the probability of a particular claim coming from a doctor as fraudulent or not, I will work with the 5M points to develop the logistic model.

(3) The DV (in the 5M or so records) has about 1% 1s and remaing 99% 0s.

(4) Back to my original question: Let's not assume what the IVs represent. The majority of them they represent percentages, i.e. number between 0 and 1 (inclusive). Other IVs reprersent various counts and their values could range from 0 up to whatever integer (say, 23, or 45, or 7000, etc).

My question is again: If I have a banch of IVs to consider for model inclusion, do I have to worry about their ranges? In other words, do I need to worry about (their sizes) scaling them, transforming them, standardizing them, etc. The puprose of this is to find out which ones are significant to include in the model?

As Peter said, working with the original IVs will give me one model. Transforming the Ivs in some way will give me another model. So, which model do I want? Obviously I want the model which will catch most fraudulent claims! So, shoud I work with the original IVs or transform them somehow?

For example, if I use two IVs VAR1 (values from 0 to 1) and VAR2 (values from 0 to 1000) do I need to worry that the VAR2 which has much bigger values than VAR1 will somehow overtake VAR1 when it comes to parameter estimation, i.e. when it comes to choosing the best model?

If I shouldn't worry about the value ranges of each of the candidate IVs, then I will just go ahead and use them in their original form (no transformation, standardizing, etc.) to see which among them are the best candidates for model inclusion. The variable selection will be done as Sig suggests using Peter's paper etc.

Your comments are appreciate so I become clear in my mind.

One other question I'd like to ask now since I gave you the background.

Peter(?) and possibly others(?) have said that if you use many obs (remember, I have 5M obs to develop the model-- about 1% fraud rate), then any minute (insignificant?) IVs will show up as significant.

I was thinking of using all 5M obs to develop the model (I have another year's worth put aside for validation, etc.). Id this appropriate? David has said the more the better! That's why I am thinking of using all of them. (The data obs go back to around year 2000 or so.) Or am I better off to randomly select a smaller number, say, 250K obs out of the 5M (keeping the 1%--99% ratio the same), to work with? I think David would say use all 5M of them?

The other issue is my small 1% fraud rate.

Do I need to do some kind of weigthing here to give more "strength" to the # of obs with fraud status=1?

For example, what if I (randomly) select a data set where the 1s represent, say, 10% and the 0s represent 90%? (Or the 1s epresent 50% and the 0s represent 50%, etc.)

Is this appropriate and if yes, how would I then need to proceed with the model development?

Maybe I just develop the model with the oversampled dataset and at the end I somehow adjust the probbailities? Do I even need to adjust the probabilities? Even if I don't adjust the probabilities, I thing the rank-ordering of the most probable fraudulent claims will still rank-order correctly. As long as the most probable fraudulent claims show uo at the top, that's all I want. We will isolate those claims for further processing.

Thank you for your thoughts.

Tom

----- Original Message ----- From: "Sigurd Hermansen" To: "Tom White" , sas-l@listserv.uga.edu Subject: RE: PROC LOGISTIC MODEL--Standardize vars? Date: Tue, 25 Mar 2008 18:39:46 -0400

Tom: Chang has astutely recommended that you consider transformations of variables before you standardize for the sake of selecting important variables. Regression models handle any monotonic linear relation between a DV or transformation of DV and an IV or transformation of an IV, regardless of scale. It's lack of linearity of the relation between IV and DV that introduces poor fit and bias, not the distribution of the DV's. (Re-scaling may do more to improve estimations of parameter variances.)

I'd take a look at logistic model estimation in PROC GENMOD before worrying too much about scaling or standardizing DV's. Also, David Cassell has tentatively recommended PROC GLMSELECT for variable selection and generalized linear model specification, perhaps combined with classification trees. Peter Flom's NESUG paper with David makes a good starting point, as does David's SGF2007 paper on bootstrapping and resampling methods in SAS. Check the SAS-L Archives for several extended threads on logistic model specification.

Model selection should be only part of a disciplined exploratory modelling process that precedes selection of a final model from a small number of candidates. Automated model selection methods favor overfitting of models to a sample of data. That means models 'predict' well in a sample with known outcomes, but not very well when applied to other samples of data. Beware of that trap. I have many scars on my legs from stepping into that trap as it lay hidden in various guises. S

-----Original Message----- From: owner-sas-l@listserv.uga.edu [mailto:owner-sas-l@listserv.uga.edu] On Behalf Of Tom White Sent: Tuesday, March 25, 2008 5:21 PM To: sas-l@listserv.uga.edu Subject: PROC LOGISTIC MODEL--Standardize vars?

Hello SAS-L, I have a general modeling question. I am trying to build fraud detection models (healthcare, credit card, etc.).So, that's the context. I will use PROC LOGISTIC. I have about 350 variables to consider. The majority of them are numeric. Some vars have values restricted between 0 and 1 (they are percentages 0% to 100%, but in data set are like 0 to 1). Other vars represent $$$ amts and they are >1. Some have values like $34,000,$156,000, etc. (wide range of values). That's my question--A question of SCALING or standardizing. When I am looking to find out which vars are important for inclusion in the PROC LOGISTIC MODEL (I am not using stepwise methods I learned long ago from David and Peter), how do the actual variable values affect inclusion or exclusion into the model? I mean, if I include one varibale whose values range from 0 to 1 and another variable whose values range from $5,000 to $1,000,000, then, I think something doesn't look right here in terms of sizes. Therefore, what is appropriate protocol to follow? Should I standardize by finding STD of each variable and work with statndardized vars instead? If not, what then? Just work with vars in their original form? Thnak you. T

-- Want an e-mail address like mine? Get a free e-mail account today at www.mail.com!

-- Want an e-mail address like mine? Get a free e-mail account today at www.mail.com!

This message is the property of Draftfcb and contains information which may be privileged or confidential. It is meant only for the intended recipients and/or their authorized agents. If you believe you have received this message in error, please notify us immediately by return e-mail and destroy any printed or electronic copies of the message. Any unauthorized use, dissemination, disclosure, or copying of this message or the information contained in it, is strictly prohibited and may be unlawful. Thank you for your cooperation. (A)

----- Original Message ----- From: "Luo, Peter" To: "Tom White" , SAS-L@LISTSERV.UGA.EDU Subject: RE: Re: PROC LOGISTIC MODEL--Standardize vars? Date: Wed, 26 Mar 2008 13:56:29 -0400

I'm not sure if IV's range will affect the regression estimation. It's not like in cluster analysis that the variable with larger scale will be unfairly given more 'weight' in the 'distance' because of the way the distance is constructed. In regression of any kind, the algorithm is looking for the slope of DV against IV. A simple dummy IV may demonstrate steeper differentiation on DV at its two value point than does an IV with more values and larger range.

Weighting will only affect the intercept (the base probability), not the coefficients of the IVs. So yes, if you don't adjust, you still get the right rank-order.

I won't build my model on a data consists of %1 1s and 99% 0s. Draw a sample with all 1s and reasonable number of 0s.

-----Original Message----- From: SAS(r) Discussion [mailto:SAS-L@LISTSERV.UGA.EDU] On Behalf Of Tom White Sent: Wednesday, March 26, 2008 12:20 PM To: SAS-L@LISTSERV.UGA.EDU Subject: Re: PROC LOGISTIC MODEL--Standardize vars?

Thank you Sig, Peter, and Chang.

Just to make sure I am clear in my mind as to what the three of you are telling me.

(1) Since I will fit a LOGISTIC model, the DV is, of course, 0s and 1s(0=No fraud, 1=Fraud). We will use historical data to build the model and use it to score new claims. So, as Peter said, we are doing prediction.

(2) My available historical dataset has about 5 years worth of claims which boils down to about 30K unique (i.e. no duplicates) data points (i.e. doctors who submit claims to us). This, in turn, translates to around 5M or so claims. Since I am trying to predict the probability of a particular claim coming from a doctor as fraudulent or not, I will work with the 5M points to develop the logistic model.

(3) The DV (in the 5M or so records) has about 1% 1s and remaing 99% 0s.

(4) Back to my original question: Let's not assume what the IVs represent. The majority of them they represent percentages, i.e. number between 0 and 1 (inclusive). Other IVs reprersent various counts and their values could range from 0 up to whatever integer (say, 23, or 45, or 7000, etc).

My question is again: If I have a banch of IVs to consider for model inclusion, do I have to worry about their ranges? In other words, do I need to worry about (their sizes) scaling them, transforming them, standardizing them, etc. The puprose of this is to find out which ones are significant to include in the model?

As Peter said, working with the original IVs will give me one model. Transforming the Ivs in some way will give me another model. So, which model do I want? Obviously I want the model which will catch most fraudulent claims! So, shoud I work with the original IVs or transform them somehow?

For example, if I use two IVs VAR1 (values from 0 to 1) and VAR2 (values from 0 to 1000) do I need to worry that the VAR2 which has much bigger values than VAR1 will somehow overtake VAR1 when it comes to parameter estimation, i.e. when it comes to choosing the best model?

If I shouldn't worry about the value ranges of each of the candidate IVs, then I will just go ahead and use them in their original form (no transformation, standardizing, etc.) to see which among them are the best candidates for model inclusion. The variable selection will be done as Sig suggests using Peter's paper etc.

Your comments are appreciate so I become clear in my mind.

One other question I'd like to ask now since I gave you the background.

Peter(?) and possibly others(?) have said that if you use many obs (remember, I have 5M obs to develop the model-- about 1% fraud rate), then any minute (insignificant?) IVs will show up as significant.

I was thinking of using all 5M obs to develop the model (I have another year's worth put aside for validation, etc.). Id this appropriate? David has said the more the better! That's why I am thinking of using all of them. (The data obs go back to around year 2000 or so.) Or am I better off to randomly select a smaller number, say, 250K obs out of the 5M (keeping the 1%--99% ratio the same), to work with? I think David would say use all 5M of them?

The other issue is my small 1% fraud rate.

Do I need to do some kind of weigthing here to give more "strength" to the # of obs with fraud status=1?

For example, what if I (randomly) select a data set where the 1s represent, say, 10% and the 0s represent 90%? (Or the 1s epresent 50% and the 0s represent 50%, etc.)

Is this appropriate and if yes, how would I then need to proceed with the model development?

Maybe I just develop the model with the oversampled dataset and at the end I somehow adjust the probbailities? Do I even need to adjust the probabilities? Even if I don't adjust the probabilities, I thing the rank-ordering of the most probable fraudulent claims will still rank-order correctly. As long as the most probable fraudulent claims show uo at the top, that's all I want. We will isolate those claims for further processing.

Thank you for your thoughts.

Tom

----- Original Message ----- From: "Sigurd Hermansen" To: "Tom White" , sas-l@listserv.uga.edu Subject: RE: PROC LOGISTIC MODEL--Standardize vars? Date: Tue, 25 Mar 2008 18:39:46 -0400

Tom: Chang has astutely recommended that you consider transformations of variables before you standardize for the sake of selecting important variables. Regression models handle any monotonic linear relation between a DV or transformation of DV and an IV or transformation of an IV, regardless of scale. It's lack of linearity of the relation between IV and DV that introduces poor fit and bias, not the distribution of the DV's. (Re-scaling may do more to improve estimations of parameter variances.)

I'd take a look at logistic model estimation in PROC GENMOD before worrying too much about scaling or standardizing DV's. Also, David Cassell has tentatively recommended PROC GLMSELECT for variable selection and generalized linear model specification, perhaps combined with classification trees. Peter Flom's NESUG paper with David makes a good starting point, as does David's SGF2007 paper on bootstrapping and resampling methods in SAS. Check the SAS-L Archives for several extended threads on logistic model specification.

Model selection should be only part of a disciplined exploratory modelling process that precedes selection of a final model from a small number of candidates. Automated model selection methods favor overfitting of models to a sample of data. That means models 'predict' well in a sample with known outcomes, but not very well when applied to other samples of data. Beware of that trap. I have many scars on my legs from stepping into that trap as it lay hidden in various guises. S

-----Original Message----- From: owner-sas-l@listserv.uga.edu [mailto:owner-sas-l@listserv.uga.edu] On Behalf Of Tom White Sent: Tuesday, March 25, 2008 5:21 PM To: sas-l@listserv.uga.edu Subject: PROC LOGISTIC MODEL--Standardize vars?

Hello SAS-L, I have a general modeling question. I am trying to build fraud detection models (healthcare, credit card, etc.).So, that's the context. I will use PROC LOGISTIC. I have about 350 variables to consider. The majority of them are numeric. Some vars have values restricted between 0 and 1 (they are percentages 0% to 100%, but in data set are like 0 to 1). Other vars represent $$$ amts and they are >1. Some have values like $34,000,$156,000, etc. (wide range of values). That's my question--A question of SCALING or standardizing. When I am looking to find out which vars are important for inclusion in the PROC LOGISTIC MODEL (I am not using stepwise methods I learned long ago from David and Peter), how do the actual variable values affect inclusion or exclusion into the model? I mean, if I include one varibale whose values range from 0 to 1 and another variable whose values range from $5,000 to $1,000,000, then, I think something doesn't look right here in terms of sizes. Therefore, what is appropriate protocol to follow? Should I standardize by finding STD of each variable and work with statndardized vars instead? If not, what then? Just work with vars in their original form? Thnak you. T

-- Want an e-mail address like mine? Get a free e-mail account today at www.mail.com!

-- Want an e-mail address like mine? Get a free e-mail account today at www.mail.com!

This message is the property of Draftfcb and contains information which may be privileged or confidential. It is meant only for the intended recipients and/or their authorized agents. If you believe you have received this message in error, please notify us immediately by return e-mail and destroy any printed or electronic copies of the message. Any unauthorized use, dissemination, disclosure, or copying of this message or the information contained in it, is strictly prohibited and may be unlawful. Thank you for your cooperation. (A)

-- Want an e-mail address like mine? Get a free e-mail account today at www.mail.com!

This message is the property of Draftfcb and contains information which may be privileged or confidential. It is meant only for the intended recipients and/or their authorized agents. If you believe you have received this message in error, please notify us immediately by return e-mail and destroy any printed or electronic copies of the message. Any unauthorized use, dissemination, disclosure, or copying of this message or the information contained in it, is strictly prohibited and may be unlawful. Thank you for your cooperation. (A)


Back to: Top of message | Previous page | Main SAS-L page