LISTSERV at the University of Georgia
Menubar Imagemap
Home Browse Manage Request Manuals Register
Previous messageNext messagePrevious in topicNext in topicPrevious by same authorNext by same authorPrevious page (September 2010)Back to main SPSSX-L pageJoin or leave SPSSX-L (or change settings)ReplyPost a new messageSearchProportional fontNon-proportional font
Date:         Tue, 14 Sep 2010 04:00:53 -0400
Reply-To:     David Marso <david.marso@GMAIL.COM>
Sender:       "SPSSX(r) Discussion" <SPSSX-L@LISTSERV.UGA.EDU>
From:         David Marso <david.marso@GMAIL.COM>
Subject:      Re: Implementation of AGGREGATE (was, re: Large Data Files)
Comments: To: wrristow@MINDSPRING.COM
Content-Type: text/plain; charset=ISO-8859-1

On Mon, 13 Sep 2010 23:07:24 -0400, Richard Ristow <wrristow@mindspring.com> wrote:

>At 03:03 AM 9/12/2010, David Marso wrote: > >>If your data are "clumpy" wrt the break variables (i.e you have LOTS >>of the same values contiguous in the file) you can do the following >>"trick" (NO PRESORT REQUIRED). >> >>AGGREGATE OUTFILE * / PRESORTED / BREAK breaks >> / Summary functions...... >> / N=N. >>SORT CASES BY breaks. >>WEIGHT BY N. >>AGGREGATE OUTFILE * / PRESORTED ..... >> >>One might think SPSS would throw a screaming hissy fit about an >>unsorted file BUT NO ;-) ... It rips through and aggregates all of >>the contiguous break values. Then you have a disordered but >>summarized file which should be much smaller that the raw file. > >The records with the same break values don't even need to be >contiguous; my recent post "Re: sorting out a nested data structure" >(Thu, 9 Sep 2010 01:18:32 -0400) uses that. This is a partial of the file as viewed in FireFox and logged into: http://listserv.uga.edu/cgi-bin/wa

&nbsp;<br> We can, using&nbsp; concatenation and duplicate functions, identify duplicate children across agencies (e.g. with different family ids) � this unique child id is a string variable:<br> 2.&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Is there an �assign� function that will automate assigning unique ids to children using the string variable we have constructed?</blockquote><br> I'm missing what you need, here. Do you already have &quot;a unique child id, [which] is a string variable&quot;, or do you need to construct one?<br><br> If the latter, given that you can identify duplicate children (I take it, that means you can recognize when two records represent the same child, even though the records are from different agencies), then if you sort the data so all records for each child are together in the file (I presume that's possible), and you have a way (with LAG or something) to determine when a record represents the same child as its predecessor, than something like (untested)<br><br> <tt><font size=2>NUMERIC Our_Child_ID(F6).<br> LEAVE&nbsp;&nbsp; Our_Child_ID.<br> DO IF&nbsp;&nbsp; $CASENUM EQ 1.<br> .&nbsp; COMPUTE Our_Child_ID = 1.<br> ELSE IF NOT &lt;same child as previous record&gt;.<br> .&nbsp; COMPUTE Our_Child_ID = Our_Child_ID + 1.<br> END IF.<br> </font></tt>&nbsp;<br> That assigns a <i>numeric</i> ID, which is easier to calculate. You can convert it to a string using the STRING function, but I'd see no need to, unless I'm misunderstanding your needs.<br><br> <blockquote type=cite class=cite cite=""> 1.&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Because the dataset is so large, manually combing thru the duplicates to assign our own unique family or child identifiers is not practical.</blockquote><br> For child identifiers, see above. <br><br> To construct family identifiers, I'd start with the view that you have a single family identifying key, consisting of an agency ID and <i>that agency's</i> family identifier. Your problem then is, a family may be in the file under several keys, and you want to recognize which (different) keys refer to the same family.<br><br> <blockquote type=cite class=cite cite=""> 3.&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Once we figure out how to assign unique child identifiers we are still faced with the problem of finding some automated way of grouping all the children within families so each family has a unique identifier, regardless of how many individual agencies/programs are providing services to the children w/i that family.&nbsp; Since the family is our unit of analysis this is critical.&nbsp;&nbsp; Any suggestions?</blockquote><br>

<SNIP>

And the resulting file >IS sorted by the break groups, unless you use MODE=ADDVARIABLES which >effectively de-sorts it again. > You are *NOT* using a PRESORTED subcommand in your code so of course it will come back ordered. I don't believe you processed the subtlety of my posting. I was describing a situation where your HUGE data file is LARGELY "grouped" but are NOT sorted specifically by these groups. by ***NOT*** sorting the file but indicating ***PRESORTED*** one ends up with a partially aggregated file which is out of order and has multiple records of a given GROUP. By WEIGHTING and reaggregating, one can obtain a final aggregated file. I suspect that the implementation of AGGREGATE has changed significantly. Back in the good old days (if I'm not mistaken) it used to sort the file internally and this presorted trick worked wonders to make the file to be sorted smaller in cases where blocks of cases are contiguous. It is really a function of the data. If your data are NOT grouped then my "trick" will probably hurt performance. OTOH, it is good to know that plain old AGGREGATE without a dedicated SORT will run with efficiency. To verify my proposal, one would need to construct various HUGE files with differing characteristics wrt "Clumping", number of distinct key combinations and number of variables in the aggregate functions. I don't have the time to spend on it and rarely encounter these sort of files, so it's really not worth my time to pursue.

>As I understand from Jon Peck's explanations (*), the default >behavior of AGGREGATE is to set up a hash table with an entry for >each set of BREAK values encountered, and accumulate aggregated >values in those table records. So, AGGREGATE can always run without >pre-sorting the file.

I wonder what the *NONDEFAULT* behavior is (ie if one specifies /PRESORTED) I'll bet it doesn't create the hash table (why would it?). It probably just rolls along and spits out cases as it hits a new set of break values.. OTOH, I search online for SPSS agorithms and they don't seem to be easy to find if at all.

So, think about it Richard, If I tell it the file is PRESORTED but don't sort the file it should bypass creating the hash table. I'll end up with a much smaller file which I can then further work with. Consider a HUGE file consisting of millions of cases where the file is 50 state files concatenated together but are not sorted by STATE. Rather than sorting you can aggregate /PRESORTED ... I suspect that the current version of AGGREGATE simply is implemented in a smart way where the previous incarnation was NOT and us intrepid ones broke the 'rules' to get the job done ;-). > >However, if there are very many break groups (like hundreds of >thousands), the hash table may fill up memory and start paging, and >performance slows by what may be orders of magnitude. When that >happens, pre-sorting and specifying /PRESORTED on AGGREGATE will run >much faster. >====================== >(*)Date: Mon, 13 Sep 2004 08:01:41 -0500 >From: "Peck, Jon" <peck@spss.com> >Subject: Re: factors influencing speed >To: SPSSX-L@LISTSERV.UGA.EDU > > and, off-list, >Subject: Re: [SPSSX-L] SORT CASES algorithm >X-MimeOLE: Produced By Microsoft Exchange V6.0.6556.0 >Date: Fri, 15 Oct 2004 07:26:39 -0500 >Message-ID: <BA1D1C9C90DA7D45BD19610BD7D5DDEE07623F@hqemail1.spss.com> >X-MS-Has-Attach: >X-MS-TNEF-Correlator: >Thread-Topic: Re: [SPSSX-L] SORT CASES algorithm >Thread-Index: AcSySiZ3s/F+IR7ZRPeI0+xjRddwiAAZmdrw >From: "Peck, Jon" <peck@spss.com> >To: "Raynald Levesque" ><rlevesque@videotron.ca>, <wrristow@mindspring.com> >X-OriginalArrivalTime: 15 Oct 2004 12:26:39.0726 (UTC) >FILETIME=[38A1B0E0:01C4B2B2] >X-ELNK-AV: 0 > >Sorting before running AGGREGATE is in most cases NOT a performance >improvement. AGGREGATE doesn't need it except in situations where >there are very many distinct break values. Most of the time, >AGGREGATE without the /PRESORTED subcommand will be faster than >sorting followed by AGGREGATE with presort. > >===================== >To manage your subscription to SPSSX-L, send a message to >LISTSERV@LISTSERV.UGA.EDU (not to SPSSX-L), with no body text except the >command. To leave the list, send the command >SIGNOFF SPSSX-L >For a list of commands to manage subscriptions, send the command >INFO REFCARD

===================== To manage your subscription to SPSSX-L, send a message to LISTSERV@LISTSERV.UGA.EDU (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD


Back to: Top of message | Previous page | Main SPSSX-L page