LISTSERV at the University of Georgia
Menubar Imagemap
Home Browse Manage Request Manuals Register
Previous messageNext messagePrevious in topicNext in topicPrevious by same authorNext by same authorPrevious page (February 2006)Back to main SPSSX-L pageJoin or leave SPSSX-L (or change settings)ReplyPost a new messageSearchProportional fontNon-proportional font
Date:   Thu, 9 Feb 2006 16:27:50 +1100
Reply-To:   paulandpen@optusnet.com.au
Sender:   "SPSSX(r) Discussion" <SPSSX-L@LISTSERV.UGA.EDU>
From:   Paul Dickson <paulandpen@optusnet.com.au>
Subject:   Re: R: Weight role in using MVA analysis
Comments:   To: Hector Maletta <hmaletta@fibertel.com.ar>
Content-Type:   text/plain

Hi Hector

Thanks for your comments as always. I agree with your first comment completely, I was lapse in restricting the application of the use of weighting to non-random data, its application is much broader than that and of course does extend to randomly sampled data.

To address one of your points, in theory the ecological validity of regression results do rest on the assumption of random sampling, but in your experience how many times has regression been run on non-random data and submitted for peer reviewed publication with little or no fanfare by colleagues. Many models developed in psychology and business via regression (certainly before SEM and still in the case of SEM) fall into that vain (uni students being the cohort that they are developed from), not ideal but very common (real world).

My point in relation to problems with a weighted imputation was more to do with the biases that can emerge from imputation via regression (not very well supported in the literature on imputation when compared with MI anyway) if the data (eg non-response) are not missing at random or missing completely at random. This is different from the sampling process and degree of randomness of the sampling process as you are not doubt aware.

Here is the scenario, the double whammy I refer to discussed in a different way. Assuming random sampling has taken place here and five cohorts from different geographical locations are sampled and retain identical levels of randomness. From this, you have 5 different cohorts that you have randomly sampled, based on geographical classification, and you are interested in looking at identifying drivers of income disparity for the total area, not each geographical cohort individually. One of the five geographical locations (cohort 1) is comprised of the lowest income earners (based on incidence for the total population this geographical area represent 20% of the total population, but in your current sample drawn from the five geographical areas, they only make up 10% of the sample because you did not have geographical quotas specified.

A weighting process is designed to upweight the contribution of (cohort 1) to reflect their real incidence rates in the total population (eg from 10% to 20% incidence) and this weighting would be incorporated during the imputation process (is this assumption of mine correct?). In other words, upweighting this sample at the time of the imputation should prevent them from being underepresented in the imputation process. If you use unweighted data in your imputation, assuming data are missing at random or missing completely at random, then the other 4 samples (cohorts) will no doubt have a biased influence on the income values derived from the imputation, particularly for low income cohort 1 due to it being under-represented at the sample level relative to what is should be represented at the total population level. These are solid grounds for weighting at the point of imputation to adjust the sample to more accurately reflect population income estimates, and to ensure that ! the imputation model more closely approximates the population breakdown, not the sample breakdown.

Agreement to here. Here is where my viewpoint departs from yours.

If the pattern of missing data is biased initially (not missing at random) for cohort 1, then problems inherent with the bias would be escalated due to weighting undertaken in the imputation phase if the biased sample was upweighted to reflect the population based on sample to popn conversion in the model (upweighting the sample that is 10% to reflect 20% population incidence). Assuming now that the low income cohort (cohort 1) have higher levels of missing data for income scores than other geographical cohorts (missing data is not missing at random, it is due at a global level to geographical location). Assume also that their missingness is skewed upwards and gives the impression that their incomes are higher as a cohort than they really are, because a large number of the highest income earners within this cohort who make the most money respond with complete data and those making the least money do not respond with the same degree of responsiveness.

At the most basic level, imputing income data using only the information from this cohort will bias the imputed results on income upwards, and any model will over-estimate income at a global level (mean) and underestimate the (SD), and this is the case for regression because regression uses this information during imputation. Therefore, parameters derived from this sample alone would tend to replace missing values that ultimately over-estimate the income of (cohort 1). The bias is being caused by the response patterns of this cohort, and can go in two directions depending on patterns of missingness, it can underestimate or overestimate actual income for the cohort (nb you are not privvy to this information in advance if you impute, unless you have past data to use to address this bias), and this is not assumed here.

If you do not weight your data and use information for the five cohorts in estimating the missing data for cohort 1, the bias inherent in the missing data of cohort 1 should be less problematic than if you weighted data because the parameters developed would be modulated/reduced when the biased sample has not been weighted and is contributing or influencing the overall regression model less. If you do weight your data, then the parameters derived from this biased but upweighted sample would have more of an impact because they are now weighted upwards to reflect popn parameters, and the model will in turn tend to replace missing values by over-estimating the income of (cohort 1) because the biased sample is having a greater influence on the model due to your weighting.

Hector, I believe with the weighting you propose in your response involves adjusting the weighting process to reflect actual income for a sample, but this requires prior knowledge of accurate income distributions and income ranges etc for each cohort and this is why imputing on weighted samples with nmar missingness is problematic. At least in mar and mcar you can assume that the 5 cohorts collectively are randomly making response errors and the remainder of the data left after the missing data reflects the real values of income. This is not the case with NMAR.

Weighting a biased sample to have a greater influence on a model causes greater problems than not weighting because it escalates the biasing (double whammy).

Regards Paul

> Hector Maletta <hmaletta@fibertel.com.ar> wrote: > > Paul, > Just a small remark. You say "Weighting from my understanding is > designed to > adjust sample information (scores etc) that is collected non-randomly > (eg > with bias)". IMHO, all mainstream statistical analysis (in particular > the > statistical significance of sample results in such standard procedures > as > linear regression) rest on the assumption that the sample is a RANDOM > sample. But a random sample may be drawn with different sampling ratios > for > different parts of the population, and weighting corrects for this. > Properly > applied, weighting corrects imbalances or disproportionalities built in > the > sample design, and are therefore ex ante. Sometimes a non-random sample > (usually resulting in proportions not close to the population as regards > gender, socioeconomic status, age, etc.) is corrected ex post through > weighting, but this weighting, as the one applied on a random sample, > does > not add to or detract from the randomness or non-randomness of the > sample. > Regarding your income example, a regression applied to the subpopulation > with valid income data would be, undoubtedly, biased if the missing > incomes > are disproportionately higher among people with high (or low) income. > The > reason is that the distribution of income in the subsample with valid > income > data would not coincide with the (unknown) distribution of income in the > whole sample. Since you do not know the latter, you may infer this if > the > distribution of PREDICTORS in the subsample is significantly different > from > their distribution in the whole sample. If so, you may correct the > original > sample weights in the subsample to reflect also the distribution of > predictors in the whole sample (for each combination of values of the > predictors, multiply the original weights by the proportion of that > combination in the whole sample, and divide by the proportion in the > subsample). > This would only be important when missing values are highly selective on > the > values of the predictors. Otherwise, i.e. when the disproportionality in > non-declaring income is slight or non-existent, the original sample > weights > would do usually quite well. > In any case, you must apply weights to yield unbiased (or less biased) > estimates for MVA. If you figure it out properly, I do not see how your > "double whammy" might arise. > Hector > > -----Mensaje original----- > De: SPSSX(r) Discussion [mailto:SPSSX-L@LISTSERV.UGA.EDU] En nombre de > Paul > Dickson > Enviado el: Wednesday, February 08, 2006 7:58 PM > Para: SPSSX-L@LISTSERV.UGA.EDU > Asunto: Re: R: Weight role in using MVA analysis > > Rita > > As far as the weighting issue and its use missing in imputing missing > data > (MVA) is concerned I have a couple of comments to make on this. > Weighting > from my understanding is designed to adjust sample information (scores > etc) > that is collected non-randomly (eg with bias) to more closely reflect > population patterns/datascores/parameters that are caused by differences > between the sample and the population. Missing data imputation (MVA) is > designed to replace missing data values that are generally MAR, or MCAR, > in > other words missing values are not caused by differences in the > characteristics of the sample (eg demographics). I believe that > assuming > missing data imputations such as MVA need to be weighted back to popn > parameters is problematic, because it would be very unlikely that the > same > pattern of bias inherent in the sample relative to the popn due to > sampling > bias is identical or necessarily close to the same pattern of bias (if > any) > inherent in the missing data caused ! > during the data collection. In fact weighting runs the risk of causing > greater bias (say for example that lower earning males tended to leave > out > their income scores more than higher earning males due to social > desirability bias). The information that the algorithm in MVA has to go > on > is a set of regression parameters generated from high income males that > it > will then use to estimate the remainder of scores in the male sample > that > have been left blank. It will tend to overestimate the scores (give > higher > scores for lower income earning males) due to biases in non-response in > favour of higher income earning males. If you further weight data to > reflect biases in income and upweight males more to reflect differences > in > earning in the popn, then you have over-weighted an already biased > imputation of the original data, causing a double whammy. > > > BTW > IMHO Gary King and Schafers Multiple Imputation packages are far > superior to > spss MVA and free!!!!!, although I have not seen version 14 of SPSS > because > it is an expensive add on. > > http://gking.harvard.edu/preprints.shtml#smooth > > http://www.stat.psu.edu/~jls/misoftwa.html > > Regards Paul > > > Rita Clivio <rclivio@tradelab.it> wrote: > > > > Yes Hector, I wrote too fast ... > > I have weight less than one not negative. > > > > In this specific case I can avoid to weight data because the sample is > > quite > > well proportionate, otherwise I dont' see any solution if to multiply > > all of > > them (maybe by 10) or sum 1 to all. Alas , you're right about > > statistical > > significance. > > > > I use MVA to replace some values which I will use in a cluster > analisys > > so I > > think that weight is important in the replacement process but less in > > the > > cluster > > > > I'm running the cluster both with MV (pairwise) and replacing them > with > > MVA > > module; I would find the same path (?), I have to decide which one > > solution > > adopt. > > > > Have you any idea ? > > > > Many thanks > > > > Rita > > > > > > PS > > > > I had subscribed this list a long time ago, then I unsubscribed > because > > of > > changing job. > > > > Now I'm here again and it's a pleasure to find you all (I remember > > Hector, > > Raynald, Richard, ...) and have help from a "message in the bottle" > :-) > > > > -----Messaggio originale----- > > Da: SPSSX(r) Discussion [mailto:SPSSX-L@LISTSERV.UGA.EDU]Per conto di > > Hector Maletta > > Inviato: martedi 7 febbraio 2006 20.36 > > A: SPSSX-L@LISTSERV.UGA.EDU > > Oggetto: Re: Weight role in using MVA analysis > > > > > > You write about weights "less than zero". I assume you mean "less than > > one". > > Weights below zero make no sense, and are considered missing by SPSS. > If > > you > > actually have any negative weights, revise the way they were computed. > > Weights cannot be negative, and also zero weights mean the case is > > excluded. > > Now assuming the weights were not negative, please notice that MVA > uses > > regression for imputation, and for regression and other such > procedures > > weighting is essential to obtain unbiased results from > disproportionate > > samples. Increasing the scale of weights in a uniform manner (e.g. > > multiplying all of them by 100) would affect the statistical > > significance > > SPSS assigns to the results, since probability and standard errors in > > SPSS > > are based on total WEIGHTED cases, but otherwise would yield the same > > results as with your original weights. In your particular case I think > > significance levels are not particularly important, but beware MVA > would > > use > > regression estimates it would otherwise consider non significant. > > Hector > > > > -----Mensaje original----- > > De: Rita Clivio [mailto:rclivio@tradelab.it] > > Enviado el: Tuesday, February 07, 2006 3:39 PM > > Para: Hector Maletta; SPSSX-L@LISTSERV.UGA.EDU > > Asunto: R: Weight role in using MVA analysis > > > > > > Thanks Hector > > I think you're right. > > > > The number of cases that I have weighting the dabse is exactly the > > number of > > cases with a weight less than zero. > > > > And so ... do you think it's useful weighting the case before > conducting > > MVA? > > If so could I use some trick (i.e. weigh * 100) in order to keep the > > original proportionin dbase ? > > > > Thanks again > > > > Rita > > > > > > > > > > > > -----Messaggio originale----- > > Da: SPSSX(r) Discussion [mailto:SPSSX-L@LISTSERV.UGA.EDU]Per conto di > > Hector Maletta > > Inviato: martedi 7 febbraio 2006 19.15 > > A: SPSSX-L@LISTSERV.UGA.EDU > > Oggetto: Re: Weight role in using MVA analysis > > > > > > One possibility is that some of your weights (40%??) are zero or > missing > > values. Cases with zero or missing weights are not "seen" as cases by > > SPSS. > > Zero weights may arise from non-zero fractional weights being rounded > > down > > to zero. > > Hector > > > > -----Mensaje original----- > > De: SPSSX(r) Discussion [mailto:SPSSX-L@LISTSERV.UGA.EDU] En nombre de > > Rita > > Clivio > > Enviado el: Tuesday, February 07, 2006 2:56 PM > > Para: SPSSX-L@LISTSERV.UGA.EDU > > Asunto: Weight role in using MVA analysis > > > > Hi > > > > I have a question for any kind soul ... :-) > > > > Why if I conduct MVA on a weighted Database I find that I have > replaced > > only > > the 60% of the cases (i.e. I have a new file with replaced values that > > is > > 60% of all cases) ? > > > > If I dont' weight, conducting MVA I have a new file with all values of > > the > > dbase. > > > > Weight is assigned to all cases in dbase. > > > > Have you any idea ? > > > > Thank in advance for your help ... and time . > > > > Rita


Back to: Top of message | Previous page | Main SPSSX-L page