Date: Wed, 28 Sep 2005 10:32:19 0400
ReplyTo: Kathryn Gardner <KJGARDNER10@HOTMAIL.COM>
Sender: "SPSSX(r) Discussion" <SPSSXL@LISTSERV.UGA.EDU>
From: Kathryn Gardner <KJGARDNER10@HOTMAIL.COM>
Subject: data screening help
ContentType: text/plain; charset=ISO88591
Hi all,
I have a number of questions relating to data screening (i.e., outlier,
normality, linearity checks) that I am hoping people can help me out with,
as I have literally exhausted all my resources. Some of the questions are
practical, others technical. I realize that there a quite a few Qs there,
but a simple “yes” / “no” responses (where possible) will be more than
enough! Or if people can only answer 1 or some of the Qs that will be just
as helpful. I’d be really really grateful of any help at all.
1) Practical question – I’ve been trying to figure out how to make my
boxplots so that I can actually see the case/ID numbers next to the
outliers. It’s OK when I have one outlier, but when I have a bunch of them
they end up on top of each other and I can’t see what ID/Case no. they are.
Is there another way SPSS can show me the case numbers of the outliers, or
a way I can visually inspect the case numbers on the boxplots? I’ve tried
blowing the boxplot up to full screen size, but even this doesn’t help.
2) The procedure for detecting outliers depends on whether data is
continuous or categorical. If it is continuous this means data screening
the sample as a whole, if categorical this means screening by group. I am
using analyses that will involve both the use of continuous and categorical
data, so how should I screen my data? I’ve been screening as a whole up
until now. Besides, if I decided to screen using groups, at what point do I
decide not the split the data into groups i.e., I could split my data
according to gender, age, ethnicity, education, occupation, country etc
etc.
3) Related to the above Q, does the idea that screening for outliers
depends on whether data is continuous or categorical apply to all data
screening procedures i.e., normality analyses?
4) I have screened my data according to subscales rather than full scale
scores e.g., checked the normality of each individual subscale on each
questionnaire (some questionnaires don’t produce full scale scores). I
don’t know whether this is standard practice, but to me it makes sense to
screen by subscale. I do however, have a variable that does produce
subscales, but I have had to use the full scale score in my data screening
because I can’t split it into subscales until I’ve factor analysed it. Is
this OK?
5) I’ve been using logarithm & square root transformations etc to reduce
skew and kurtosis, but these transformations don’t appear to be effective
in improving normality when there is only high or low kurtosis (i.e., when
skew is OK). Any suggestions?
6) In some cases I’ve transformed a variable to reduce skew so it is less
than 1, but this has sometimes also inflated kurtosis by about 0.6. Is it
best to have a variable with a skew level of about 1.3 and kurtosis 0.15,
or a variable with skew at .6 and kurtosis about .6?
7) When computing mahlanobis distance I am assuming that I move “all” of my
variables into the “independents” box. I have about 40 variables because
as well as some fullscale variables, I also have some variables that
assess subscales and do not combine to produce a fullscale score.
However, what about the variable for which I only have full scale scores
for (because I’ve not yet factor analysed the scale)? Am I ok to simply put
this variable across even though I will at some point be breaking the scale
down into subscales?
"THANK YOU" to anyone who has taken the time to read this email. I really
do appreciate the help
Kathryn
