Date: Thu, 7 Dec 2000 18:59:07 -0600
Reply-To: Carol Albright <syzygy@TC.UMN.EDU>
Sender: "SPSSX(r) Discussion" <SPSSX-L@LISTSERV.UGA.EDU>
From: Carol Albright <syzygy@TC.UMN.EDU>
Subject: Re: Merging two files
In-Reply-To: <3.0.1.32.20001207175114.007c9540@email.uah.edu>
Content-Type: text/enriched; charset="us-ascii"
Hi, Amy!
First you need to restructure your many-to-one file (lots of records per
patient) into a one-to-one (one record but more variables) file. This
was just discussed last week so it's fresh in my bean. I'm assuming that
the order in which the DX & DVX codes land within patient does not matter
(ie you're not trying to get code 1234 to be stored in DVX3). Then once
you've done that it's a simple MATCH FILES problem. Test this out with a
subsample before putting your computer to work (millions of records,
ug!).
Try playing with this syntax to create the new "D" file (save your
master D file before doing this as it adds new temporary variables):
<fontfamily><param>Courier New</param>*******************.
* Here's your data to use as an example.
*******************.
DATA LIST LIST /seq dx dxv.
BEGIN DATA
001 25002 0
001 4892 1
001 418 0
001 26093 0
001 25091 0
002 2491 1
003 4928 0
003 250 1
003 5924 0
004 3976 1
END DATA.
LIST.
</fontfamily>
* Be sure and sort your data by SEQ and any other sequencing you need.
sort cases by SEQ (A).
<fontfamily><param>Courier New</param>* Next, number every record within
each patient; "seq" is the unique patient ID.
* This number will be used to name the variables.
compute #= ((seq=LAG(seq))* #) + 1.
compute recno=#.
freq recno.
* Check the frequency count of RECNO and decide how many instances of .
* secondary diagnoses you want to keep.
* Let's say the maximum number of 2nd DX is 5 and you want to keep
all.
* If you have more than two variables, then modify the VECTOR command
(list the new.
* variable name stems here) and add additional COMPUTE lines.
* This creates 2 sets of variables, named dx2_1 to dx2_5 and dxv2_1 to
dx2_5.
vector dx2_ dxv2_ (5).
compute dx2_(recno) = dx.
compute dxv2_(recno) = dxv.
execute.
* The AGGREGATE command saves just the records with the complete data.
* Be sure to change the filename and directory to what you need.
* Also add additional lines for each additional variable created (and
change the .
* names & #'s obviously.
aggregate outfile = "D_onerec.sav"
/presorted
/break = seq
/dx2_1 to dx2_5 = last(dx2_1 to dx2_5)
/dxv2_1 to dxv2_5 = last(dxv2_1 to dxv2_5).
</fontfamily>
You've just done the tough part -- now just merge the two files, using
MATCH FILES. Open your ABC.sav file and sort it by SEQ if it isn't
already. I'm assuming that ABC is the "master" file and that not
everyone has a D record (in the D_onerec.sav file).
<fontfamily><param>Courier New</param>MATCH FILES /FILE=*
/TABLE='D_onerec.sav'
/BY seq.
EXECUTE.
</fontfamily>
The new variables will not have variable labels, so you'll need to add
those. If you have gobs and it's time consuming, Raynald posted a script
for generating labels yesterday that you could adapt.
HTH
Carol
At 05:51 PM 12/7/00 -0600, Amy Shadoin wrote:
>Dear Listmembers,
> I am trying to merge two files of unequal dimensions and am
hoping someone
>can help me write appropriate syntax or script to do it in a
>resource-efficient manner. There are literally millions of these
records
>per file!
> My first file, ABC.sav contains individual discharge records, 1
record per
>discharge. Each record has a unique identifier (SEQ). Most of the
variables
>are truly numeric. However, DX1 is a string variable with numbers that
>represent ICD9 diagnosis category names. It is the primary ICD9
diagnosis
>at discharge.
>
>The ABC.sav file is set up as follows:
>
>SEQ AGE SEX DCCHPR1 DRG LOS TOTCHG DX1 DXV1
>001 45 1 250 102 5 10522.85
25002 Y
>002 73 0 263 378 8 38746.73
26001 Y
>003 57 1 294 259 4 24685.03
29042 N
>etc.
>
>
>The second file contains variable numbers of records per individual.
It
>contains the individual discharge record identifier SEQ and two other
>variables per record indicating each diagnosis category applicable at
>discharge. It is set up as follows:
>
>SEQ DX DXV
>001 25002 0
>001 4892 1
>001 418 0
>001 26093 0
>001 25091 0
>002 2491 1
>003 4928 0
>003 250 1
>003 5924 0
>004 3976 1
>etc.
>
>I need to develop syntax or a script that creates an ABCD.sav file with
the
>following constraints:
> 1) contains all the variables in ABC.sav and D.sav for each case
as
>identified by SEQ on a single record
> 2) puts all of the DX and DXV variables (all listed diagnoses)
from D.sav
>at the end of each record in some format such as DX21 DX22 DX23 DX24
DXV21
>DXV22 DXV23 etc. so they are not confused with the DX1 and DXV1
variables
>(primary discharge diagnosis info) in the ABC.sav file.
>
>What I need is a file that looks something like this:
>
>SEQ AGE SEX DCCHPR1 DRG LOS TOTCHG DX1 DXV1 DX21 DX22
etc
>001 45 1 250 102 5 10522.85 25002 Y 25002
4892
>002 73 0 263 378 8 38746.73 26001 Y 2491 .
>003 57 1 294 259 4 24685.03 29042 N 4928 250
>
>I'd like do as little variable renaming in the ABC.sav file as
possible.
>Can anyone help?
>
>Thanks in advance!
>
>Amy Shadoin
>Amy L. Shadoin, Ph.D.
>Assistant Professor,
>College of Nursing
>University of Alabama in Huntsville
>Huntsville, AL 35899
>
>Phone: 256-824-2428
>Fax: 256-824-6026
>
>
-------------------------------------------------------------------------
Carol L. Albright, MS | E-Mail : syzygy@tc.umn.edu
Albright Consulting | Phone : 651/699-7218
2031 Goodrich Avenue | http://www.tc.umn.edu/~syzygy
St. Paul, MN 55105 USA | Research data services
-------------------------------------------------------------------------