LISTSERV at the University of Georgia
Menubar Imagemap
Home Browse Manage Request Manuals Register
Previous messageNext messagePrevious in topicNext in topicPrevious by same authorNext by same authorPrevious page (December 2000)Back to main SPSSX-L pageJoin or leave SPSSX-L (or change settings)ReplyPost a new messageSearchProportional fontNon-proportional font
Date:         Thu, 7 Dec 2000 18:59:07 -0600
Reply-To:     Carol Albright <syzygy@TC.UMN.EDU>
Sender:       "SPSSX(r) Discussion" <SPSSX-L@LISTSERV.UGA.EDU>
From:         Carol Albright <syzygy@TC.UMN.EDU>
Subject:      Re: Merging two files
Comments: To: Amy Shadoin <shadoina@EMAIL.UAH.EDU>
In-Reply-To:  <3.0.1.32.20001207175114.007c9540@email.uah.edu>
Content-Type: text/enriched; charset="us-ascii"

Hi, Amy!

First you need to restructure your many-to-one file (lots of records per patient) into a one-to-one (one record but more variables) file. This was just discussed last week so it's fresh in my bean. I'm assuming that the order in which the DX & DVX codes land within patient does not matter (ie you're not trying to get code 1234 to be stored in DVX3). Then once you've done that it's a simple MATCH FILES problem. Test this out with a subsample before putting your computer to work (millions of records, ug!).

Try playing with this syntax to create the new "D" file (save your master D file before doing this as it adds new temporary variables):

<fontfamily><param>Courier New</param>*******************.

* Here's your data to use as an example.

*******************.

DATA LIST LIST /seq dx dxv.

BEGIN DATA

001 25002 0

001 4892 1

001 418 0

001 26093 0

001 25091 0

002 2491 1

003 4928 0

003 250 1

003 5924 0

004 3976 1

END DATA.

LIST.

</fontfamily>

* Be sure and sort your data by SEQ and any other sequencing you need.

sort cases by SEQ (A).

<fontfamily><param>Courier New</param>* Next, number every record within each patient; "seq" is the unique patient ID.

* This number will be used to name the variables.

compute #= ((seq=LAG(seq))* #) + 1.

compute recno=#.

freq recno.

* Check the frequency count of RECNO and decide how many instances of .

* secondary diagnoses you want to keep.

* Let's say the maximum number of 2nd DX is 5 and you want to keep all.

* If you have more than two variables, then modify the VECTOR command (list the new.

* variable name stems here) and add additional COMPUTE lines.

* This creates 2 sets of variables, named dx2_1 to dx2_5 and dxv2_1 to dx2_5.

vector dx2_ dxv2_ (5).

compute dx2_(recno) = dx.

compute dxv2_(recno) = dxv.

execute.

* The AGGREGATE command saves just the records with the complete data.

* Be sure to change the filename and directory to what you need.

* Also add additional lines for each additional variable created (and change the .

* names & #'s obviously.

aggregate outfile = "D_onerec.sav"

/presorted

/break = seq

/dx2_1 to dx2_5 = last(dx2_1 to dx2_5)

/dxv2_1 to dxv2_5 = last(dxv2_1 to dxv2_5).

</fontfamily>

You've just done the tough part -- now just merge the two files, using MATCH FILES. Open your ABC.sav file and sort it by SEQ if it isn't already. I'm assuming that ABC is the "master" file and that not everyone has a D record (in the D_onerec.sav file).

<fontfamily><param>Courier New</param>MATCH FILES /FILE=*

/TABLE='D_onerec.sav'

/BY seq.

EXECUTE.

</fontfamily>

The new variables will not have variable labels, so you'll need to add those. If you have gobs and it's time consuming, Raynald posted a script for generating labels yesterday that you could adapt.

HTH

Carol

At 05:51 PM 12/7/00 -0600, Amy Shadoin wrote:

>Dear Listmembers,

> I am trying to merge two files of unequal dimensions and am hoping someone

>can help me write appropriate syntax or script to do it in a

>resource-efficient manner. There are literally millions of these records

>per file!

> My first file, ABC.sav contains individual discharge records, 1 record per

>discharge. Each record has a unique identifier (SEQ). Most of the variables

>are truly numeric. However, DX1 is a string variable with numbers that

>represent ICD9 diagnosis category names. It is the primary ICD9 diagnosis

>at discharge.

>

>The ABC.sav file is set up as follows:

>

>SEQ AGE SEX DCCHPR1 DRG LOS TOTCHG DX1 DXV1

>001 45 1 250 102 5 10522.85 25002 Y

>002 73 0 263 378 8 38746.73 26001 Y

>003 57 1 294 259 4 24685.03 29042 N

>etc.

>

>

>The second file contains variable numbers of records per individual. It

>contains the individual discharge record identifier SEQ and two other

>variables per record indicating each diagnosis category applicable at

>discharge. It is set up as follows:

>

>SEQ DX DXV

>001 25002 0

>001 4892 1

>001 418 0

>001 26093 0

>001 25091 0

>002 2491 1

>003 4928 0

>003 250 1

>003 5924 0

>004 3976 1

>etc.

>

>I need to develop syntax or a script that creates an ABCD.sav file with the

>following constraints:

> 1) contains all the variables in ABC.sav and D.sav for each case as

>identified by SEQ on a single record

> 2) puts all of the DX and DXV variables (all listed diagnoses) from D.sav

>at the end of each record in some format such as DX21 DX22 DX23 DX24 DXV21

>DXV22 DXV23 etc. so they are not confused with the DX1 and DXV1 variables

>(primary discharge diagnosis info) in the ABC.sav file.

>

>What I need is a file that looks something like this:

>

>SEQ AGE SEX DCCHPR1 DRG LOS TOTCHG DX1 DXV1 DX21 DX22 etc

>001 45 1 250 102 5 10522.85 25002 Y 25002 4892

>002 73 0 263 378 8 38746.73 26001 Y 2491 .

>003 57 1 294 259 4 24685.03 29042 N 4928 250

>

>I'd like do as little variable renaming in the ABC.sav file as possible.

>Can anyone help?

>

>Thanks in advance!

>

>Amy Shadoin

>Amy L. Shadoin, Ph.D.

>Assistant Professor,

>College of Nursing

>University of Alabama in Huntsville

>Huntsville, AL 35899

>

>Phone: 256-824-2428

>Fax: 256-824-6026

>

>

-------------------------------------------------------------------------

Carol L. Albright, MS | E-Mail : syzygy@tc.umn.edu

Albright Consulting | Phone : 651/699-7218

2031 Goodrich Avenue | http://www.tc.umn.edu/~syzygy

St. Paul, MN 55105 USA | Research data services

-------------------------------------------------------------------------


Back to: Top of message | Previous page | Main SPSSX-L page