Date: Tue, 14 Sep 2010 04:00:53 -0400
Reply-To: David Marso <david.marso@GMAIL.COM>
Sender: "SPSSX(r) Discussion" <SPSSX-L@LISTSERV.UGA.EDU>
From: David Marso <david.marso@GMAIL.COM>
Subject: Re: Implementation of AGGREGATE (was, re: Large Data Files)
Content-Type: text/plain; charset=ISO-8859-1
On Mon, 13 Sep 2010 23:07:24 -0400, Richard Ristow <firstname.lastname@example.org>
>At 03:03 AM 9/12/2010, David Marso wrote:
>>If your data are "clumpy" wrt the break variables (i.e you have LOTS
>>of the same values contiguous in the file) you can do the following
>>"trick" (NO PRESORT REQUIRED).
>>AGGREGATE OUTFILE * / PRESORTED / BREAK breaks
>> / Summary functions......
>> / N=N.
>>SORT CASES BY breaks.
>>WEIGHT BY N.
>>AGGREGATE OUTFILE * / PRESORTED .....
>>One might think SPSS would throw a screaming hissy fit about an
>>unsorted file BUT NO ;-) ... It rips through and aggregates all of
>>the contiguous break values. Then you have a disordered but
>>summarized file which should be much smaller that the raw file.
>The records with the same break values don't even need to be
>contiguous; my recent post "Re: sorting out a nested data structure"
>(Thu, 9 Sep 2010 01:18:32 -0400) uses that.
This is a partial of the file as viewed in FireFox and logged into:
We can, using concatenation and duplicate functions, identify
duplicate children across agencies (e.g. with different family ids) �
this unique child id is a string variable:<br>
2. Is there an �assign� function that
will automate assigning unique ids to children using the string variable
we have constructed?</blockquote><br>
I'm missing what you need, here. Do you already have "a unique child
id, [which] is a string variable", or do you need to construct
If the latter, given that you can identify duplicate children (I take it,
that means you can recognize when two records represent the same child,
even though the records are from different agencies), then if you sort
the data so all records for each child are together in the file (I
presume that's possible), and you have a way (with LAG or something) to
determine when a record represents the same child as its predecessor,
than something like (untested)<br><br>
<tt><font size=2>NUMERIC Our_Child_ID(F6).<br>
DO IF $CASENUM EQ 1.<br>
. COMPUTE Our_Child_ID = 1.<br>
ELSE IF NOT <same child as previous record>.<br>
. COMPUTE Our_Child_ID = Our_Child_ID + 1.<br>
That assigns a <i>numeric</i> ID, which is easier to calculate. You can
convert it to a string using the STRING function, but I'd see no need to,
unless I'm misunderstanding your needs.<br><br>
<blockquote type=cite class=cite cite="">
1. Because the dataset is so large,
manually combing thru the duplicates to assign our own unique family or
child identifiers is not practical.</blockquote><br>
For child identifiers, see above. <br><br>
To construct family identifiers, I'd start with the view that you have a
single family identifying key, consisting of an agency ID and <i>that
agency's</i> family identifier. Your problem then is, a family may be in
the file under several keys, and you want to recognize which (different)
keys refer to the same family.<br><br>
<blockquote type=cite class=cite cite="">
3. Once we figure out how to assign
unique child identifiers we are still faced with the problem of finding
some automated way of grouping all the children within families so each
family has a unique identifier, regardless of how many individual
agencies/programs are providing services to the children w/i that
family. Since the family is our unit of analysis this is
critical. Any suggestions?</blockquote><br>
And the resulting file
>IS sorted by the break groups, unless you use MODE=ADDVARIABLES which
>effectively de-sorts it again.
You are *NOT* using a PRESORTED subcommand in your code so of course it will
come back ordered. I don't believe you processed the subtlety of my
posting. I was describing a situation where your HUGE data file is LARGELY
"grouped" but are NOT sorted specifically by these groups.
by ***NOT*** sorting the file but indicating ***PRESORTED*** one ends up
with a partially aggregated file which is out of order and has multiple
records of a given GROUP. By WEIGHTING and reaggregating, one can obtain a
final aggregated file. I suspect that the implementation of AGGREGATE has
changed significantly. Back in the good old days (if I'm not mistaken) it
used to sort the file internally and this presorted trick worked wonders to
make the file to be sorted smaller in cases where blocks of cases are
contiguous. It is really a function of the data. If your data are NOT
grouped then my "trick" will probably hurt performance. OTOH, it is good to
know that plain old AGGREGATE without a dedicated SORT will run with
To verify my proposal, one would need to construct various HUGE files with
differing characteristics wrt "Clumping", number of distinct key
combinations and number of variables in the aggregate functions.
I don't have the time to spend on it and rarely encounter these sort of
files, so it's really not worth my time to pursue.
>As I understand from Jon Peck's explanations (*), the default
>behavior of AGGREGATE is to set up a hash table with an entry for
>each set of BREAK values encountered, and accumulate aggregated
>values in those table records. So, AGGREGATE can always run without
>pre-sorting the file.
I wonder what the *NONDEFAULT* behavior is (ie if one specifies /PRESORTED)
I'll bet it doesn't create the hash table (why would it?). It probably just
rolls along and spits out cases as it hits a new set of break values..
OTOH, I search online for SPSS agorithms and they don't seem to be easy to
find if at all.
So, think about it Richard, If I tell it the file is PRESORTED but don't
sort the file it should bypass creating the hash table. I'll end up with a
much smaller file which I can then further work with.
Consider a HUGE file consisting of millions of cases where the file is 50
state files concatenated together but are not sorted by STATE.
Rather than sorting you can aggregate /PRESORTED ...
I suspect that the current version of AGGREGATE simply is implemented in a
smart way where the previous incarnation was NOT and us intrepid ones broke
the 'rules' to get the job done ;-).
>However, if there are very many break groups (like hundreds of
>thousands), the hash table may fill up memory and start paging, and
>performance slows by what may be orders of magnitude. When that
>happens, pre-sorting and specifying /PRESORTED on AGGREGATE will run
>(*)Date: Mon, 13 Sep 2004 08:01:41 -0500
>From: "Peck, Jon" <email@example.com>
>Subject: Re: factors influencing speed
> and, off-list,
>Subject: Re: [SPSSX-L] SORT CASES algorithm
>X-MimeOLE: Produced By Microsoft Exchange V6.0.6556.0
>Date: Fri, 15 Oct 2004 07:26:39 -0500
>Thread-Topic: Re: [SPSSX-L] SORT CASES algorithm
>From: "Peck, Jon" <firstname.lastname@example.org>
>To: "Raynald Levesque"
>X-OriginalArrivalTime: 15 Oct 2004 12:26:39.0726 (UTC)
>Sorting before running AGGREGATE is in most cases NOT a performance
>improvement. AGGREGATE doesn't need it except in situations where
>there are very many distinct break values. Most of the time,
>AGGREGATE without the /PRESORTED subcommand will be faster than
>sorting followed by AGGREGATE with presort.
>To manage your subscription to SPSSX-L, send a message to
>LISTSERV@LISTSERV.UGA.EDU (not to SPSSX-L), with no body text except the
>command. To leave the list, send the command
>For a list of commands to manage subscriptions, send the command
To manage your subscription to SPSSX-L, send a message to
LISTSERV@LISTSERV.UGA.EDU (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
For a list of commands to manage subscriptions, send the command