Date: Wed, 28 Sep 2005 10:32:19 -0400
Reply-To: Kathryn Gardner <KJGARDNER10@HOTMAIL.COM>
Sender: "SPSSX(r) Discussion" <SPSSX-L@LISTSERV.UGA.EDU>
From: Kathryn Gardner <KJGARDNER10@HOTMAIL.COM>
Subject: data screening help
Content-Type: text/plain; charset=ISO-8859-1
I have a number of questions relating to data screening (i.e., outlier,
normality, linearity checks) that I am hoping people can help me out with,
as I have literally exhausted all my resources. Some of the questions are
practical, others technical. I realize that there a quite a few Qs there,
but a simple ďyesĒ / ďnoĒ responses (where possible) will be more than
enough! Or if people can only answer 1 or some of the Qs that will be just
as helpful. Iíd be really really grateful of any help at all.
1) Practical question Ė Iíve been trying to figure out how to make my
boxplots so that I can actually see the case/ID numbers next to the
outliers. Itís OK when I have one outlier, but when I have a bunch of them
they end up on top of each other and I canít see what ID/Case no. they are.
Is there another way SPSS can show me the case numbers of the outliers, or
a way I can visually inspect the case numbers on the boxplots? Iíve tried
blowing the boxplot up to full screen size, but even this doesnít help.
2) The procedure for detecting outliers depends on whether data is
continuous or categorical. If it is continuous this means data screening
the sample as a whole, if categorical this means screening by group. I am
using analyses that will involve both the use of continuous and categorical
data, so how should I screen my data? Iíve been screening as a whole up
until now. Besides, if I decided to screen using groups, at what point do I
decide not the split the data into groups i.e., I could split my data
according to gender, age, ethnicity, education, occupation, country etc
3) Related to the above Q, does the idea that screening for outliers
depends on whether data is continuous or categorical apply to all data
screening procedures i.e., normality analyses?
4) I have screened my data according to subscales rather than full scale
scores e.g., checked the normality of each individual subscale on each
questionnaire (some questionnaires donít produce full scale scores). I
donít know whether this is standard practice, but to me it makes sense to
screen by subscale. I do however, have a variable that does produce
subscales, but I have had to use the full scale score in my data screening
because I canít split it into subscales until Iíve factor analysed it. Is
5) Iíve been using logarithm & square root transformations etc to reduce
skew and kurtosis, but these transformations donít appear to be effective
in improving normality when there is only high or low kurtosis (i.e., when
skew is OK). Any suggestions?
6) In some cases Iíve transformed a variable to reduce skew so it is less
than 1, but this has sometimes also inflated kurtosis by about 0.6. Is it
best to have a variable with a skew level of about 1.3 and kurtosis 0.15,
or a variable with skew at .6 and kurtosis about .6?
7) When computing mahlanobis distance I am assuming that I move ďallĒ of my
variables into the ďindependentsĒ box. I have about 40 variables because
as well as some full-scale variables, I also have some variables that
assess subscales and do not combine to produce a full-scale score.
However, what about the variable for which I only have full scale scores
for (because Iíve not yet factor analysed the scale)? Am I ok to simply put
this variable across even though I will at some point be breaking the scale
down into subscales?
"THANK YOU" to anyone who has taken the time to read this e-mail. I really
do appreciate the help