|Date: ||Thu, 7 Jul 2005 17:54:00 -0400|
|Reply-To: ||Richard Ristow <firstname.lastname@example.org>|
|Sender: ||"SPSSX(r) Discussion" <SPSSX-L@LISTSERV.UGA.EDU>|
|From: ||Richard Ristow <email@example.com>|
|Subject: ||Re: Importing data: Problems with data structure from Excel file|
|Content-Type: ||text/plain; charset="us-ascii"; format=flowed|
At 03:31 PM 7/7/2005, Rob wrote:
>[I'm importing] data from an Excel file. The dataset includes patient
>data (weight, blood pressure etc.) measured at 10 different time
>Unfortunately the dataset has not been entered in "SPSS mode", but in
>such way that each patient fills out 10 rows (representing the 10 time
>points) with each variable accordingly ordered in its own column.
>Further, one column indicates the patient's id number (thus repeated
>10 times). I would of course like each line to represent one patient
>with all (time) variables separated in they own columns.
>Is there anyway in SPSS (or in Excel for that matter) to avoid
>manually re-entering all the data ??
John Norton's solution is right on: read the data into SPSS as it is,
and there are a number of ways to restructure it. (By the way, I hope
to goodness that each row gives the time that row's data was taken. And
that you've thought what to do if, say, a patient's third row is
Don't assume, though, that (the appropriate) "SPSS mode" is "each line
[case] representing one patient with all (time) variables separated in
their own columns [variables]". The question's been discussed a lot on
this list. "Each line representing [all data for] one patient," is now
being called 'wide' data organization. The organization in your Excel
spreadsheet (or, the same transferred to SPSS) is called 'long'
The developing consensus (I'm one of those arguing for it) is that
'long' is usually better. In brief,
- Fewer variables are always easier to keep track of and manage. (More
records - 'cases' in SPSS - add very little complexity.)
- In particular, data transformations are much easier. In 'long'
structure, converting weight from pounds to kg., or calculating BMI
(body mass index), are one COMPUTE statement each. In 'wide' structure,
they're a LOOP or DO REPEAT at best, and ten new variables to name.
- Data oddities, such as missing data at occasional time points, are
much easier to handle: you leave out a record as a whole, rather than
using logic to decide which variables you skip in creating the 'wide'
- As John showed, it's very easy to go to 'wide' organization for
analysis, as you need to. You rarely need to transform the whole data
set to 'wide'; usually, it's a subset of the variables, and a subset of
the time periods.
- For many analyses you'll use summary statistics: maximum blood
pressure, difference of highest and lowest observed weight. Those are
generally easier to calculate across SPSS 'cases' (records) than within
them: hurray for AGGREGATE.
(Alas, some useful summary measures, like within-patient regression
slope for weight, are very difficult with either organization, though I
may be missing something.)
Hmmmm. That wasn't all that 'brief'.
It may sound academic, but in database design theory, the 'wide'
organization is badly un-normalized. You don't have to make database
design theorists happy, but they advocate normalization for a lot of
good reasons, including those I've mentioned.