Date: Tue, 2 Feb 1999 17:06:12 -0600
Reply-To: "Matheson, David" <davidm@SPSS.COM>
Sender: "SPSSX(r) Discussion" <SPSSX-L@UGA.CC.UGA.EDU>
From: "Matheson, David" <davidm@SPSS.COM>
Subject: Re: Correlated sampling?
I haven't seen the original posting in the list yet. For 2 variables with a
specified
correlation, the following code should work (it did on my trial).
* suppose that you want y1 and y2 to each be normal(0,1)
* and have a correlation of .6. y1 is generated from normal(0,1).
* If r(y1,y2) = .6, then .6**2 = .36 of the variance of y2 is accounted for
by y1.
* .8**2 = .64 of the variance of y2 is not accounted for by y1.
compute y1 = normal(1).
compute y2 = y1*0.6 + normal(.8).
execute.
* These commands would generate y1 and y2 with SDs very close to 1 and
* a correlation very close to .6. y1 and y2 will not be exactly N(0,1) and
* correlation between the results of the 2 calls of normal(1) will not be
exactly 0,
* so the resulting correlation between y1 and y2 will not be exactly 0. As
the
* generated sample size increases, y1 and y2 should approach these
* specifications more closely.
* To force y1 and y2 to fit the specifications exactly, you can take the
* somewhat longer approach below. The variables fac1_1 and fac2_1
* are saved by FACTOR as orthogonal component scores with means of 0
* and standard deviations of 1.
compute w1 = normal(1).
compute w2 = normal(1).
FACTOR
/VARIABLES w1 w2 /MISSING LISTWISE /ANALYSIS w1 w2
/PRINT INITIAL EXTRACTION
/CRITERIA FACTORS(2) ITERATE(25)
/EXTRACTION PC
/ROTATION NOROTATE
/SAVE REG(ALL)
/METHOD=CORRELATION .
compute y1 = fac1_1.
compute y2 = fac1_1*0.6 + fac2_1*0.8 .
correlations y1 y2 .
A more general solution, for more than 2 variables, is pasted below. Note
that the SPSS Matrix language, which is used in this solution, is available
in the SPSS Base as of version 9. In older desktop versions of SPSS, the
Matrix language was part of the Advanced Statistics module.
As a general caveat re. random number generation, the following is excerpted
from a note by David Nichols, the manager of Statistical Support at SPSS:
>> >SPSS can be used for simulations, but as with just about all
> >> >packages, I'd be careful if you mean to do serious work in
> >> >the area. We use a multiplicative congruential generator with
> >> >fairly good properties; certainly good enough for classroom
> >> >demonstrations and such, but not really suited for heavy duty
> >> >simulation work.
See McCulloch, B.D. (1998). Assessing the reliability of statistical
software: Part I. American Statistician, 52(4), 358-366.
I hope this helps.
David Matheson
SPSS Technical Support
Q.
How can I generate data which are multivariate normal and
have a covariance or correlation matrix that I specify?
A.
In SPSS 4.0 or above on non-DOS platforms, this job can be performed
with the matrix language of SPSS. Users of SPSS/PC+ or SPSSX can do
part of the job but will have to perform some matrix algebra outside
of SPSS. The SPSS 4.0 approach is described first, followed by a
description of the SPSSX approach. The general algorithm is described
in Rubinstein, R. (1981), "Simulation and the Monte Carlo Method",
Wiley.
I SPSS 4.0 Method
The basic strategy is:
1. Before entering matrix, generate the required number of normally-
distributed variables by using compute commands. These variables should
be multivariate normal (MVN) with correlations approaching 0,
i.e. approaching independence.
2. To compute a set of variables which are multivariate normal and
strictly independent, or orthogonal, run the FACTOR procedure
with the variables created in Step 1. Use the default extraction
method of principal components, extracting as many components
as you included in the analysis. Save these components to a
new set of variables with the FACTOR /SAVE subcommand. These
new variables will be MVN and orthogonal with means of 0 and
variances of 1.
3. Enter the matrix language and read the set of standard normal variables
from Step 2 as a matrix, Z for example. Then define and enter the
target covariance matrix, S.
4. Calculate the Cholesky factor for the target covariance matrix.
The Cholesky factor is an upper triangular matrix which is the
"square root" of the covariance matrix. If V is the Cholesky
factor of matrix S, and VT is the transpose of V, then
VT*V=S , where * is the matrix multiplication operator.
5. Post-multiply the independent standard normals, Z, by the Cholesky
factor, V, to give a new data matrix, X. If the target matrix was
a correlation matrix and you want the new variables to have a
standard deviation other than 1, then multiply X by the desired SD.
If you want the new variables to have a nonzero mean, add the desired
mean to X (but not before multiplication by SD).
If the target covariance matrix is not a correlation matrix, the
new variables will have variances equal to their respective
diagonal elements in the target matrix. Their means will be 0.
The desired mean can be added to the variables with a compute
statement, either before or after leaving matrix.
6. Save the X matrix to variables in the SPSS active file and leave
matrix.
An example of the implementation of this algorithm is shown below. The
target matrix is a correlation matrix for the five variables to be
generated. I added some statements to print the determinant and the
condition number of the input covariance matrix to insure that it was
not singular or ill-conditioned. I also printed the product of VT*V to
insure that the target matrix was reproduced. All the new variables
were scaled to have means of 100 and standard deviations of 15 before
leaving matrix.
input program.
+ loop #i = 1 to 10000.
+ do repeat response=r1 to r5 .
+ compute response = normal(1) .
+ end repeat.
+ end case.
+ end loop.
+ end file.
end input program.
correlations r1 to r5 / statistics=descriptives.
* Factor procedure computes pr1 to pr5, which are standard MVN .
factor variables = r1 to r5 / print = default det
/criteria = factors(5) /save=reg (all,pr).
* use matrix to set corr matrix.
* x is a 10,000 by 5 matrix of independent standard normals .
* cor is the target covariance matrix.
* cho is the Cholesky factor of cor .
* newx is the 10,000 by 5 data matrix which has target covariance matrix .
matrix.
get x / variables=pr1 to pr5.
compute cor={1, 0.4, 0.3, 0.2, 0.1 ;
0.4, 1, 0.4, 0.3, 0.2 ;
0.3, 0.4, 1, 0.4, 0.3 ;
0.2, 0.3, 0.4, 1, 0.4 ;
0.1, 0.2, 0.3, 0.4, 1 }.
compute deter=det(cor).
print deter / title "determinant of corr matrix" / format=f10.7 .
print sval(cor) / title "singular value decomposition of corr".
print eval(cor) / title "eigenvalues of input corr".
* In a symmetric matrix sval and eigenvalues are identical - choose 1 .
compute condnum=mmax(sval(cor))/mmin(sval(cor)).
print condnum / title "condition number of corr matrix" / format=f10.2 .
compute cho=chol(cor).
print cho / title "cholesky factor of corr matrix" .
compute chochek=t(cho)*cho.
print chochek / title "chol factor premult by its transpose " /format=f10.2
.
compute newx=x*cho.
compute newx=newx*15 + 100.
save newx /outfile=* /variables= nr1 to nr5.
end matrix.
correlations nr1 to nr5 / statistics=descriptives.
II SPSSX and SPSS/PC+ Method
If the matrix language is not available, users can apply the Cholesky
factor matrix to the standard normal variables with a series of
compute commands. However, the Cholesky factorization must be
calculated outside of SPSS. The formulae for the Cholesky are
provided in Rubinstein's book and in Bock (1975), "Multivariate
Statistical Methods in Behavioral Research", McGraw-Hill.
The example below uses the Cholesky factor for the target matrix in
the previous example. The Cholesky factor matrix is printed below.
One compute statement is required for each new variable. Note that the
values in the columns of the Cholesky matrix act as weights for the
original standard normal variables and these weighted variables are
combined to form the new variables.
Cholesky Matrix for the Target Correlation Matrix.
1.0 0.4 0.3 0.2 0.1
0 0.91652 0.30551 0.24004 0.17457
0 0 0.9037 0.29508 0.23976
0 0 0 0.90294 0.29608
0 0 0 0 0.90243
* SPSS-X, SPSS or SPSS/PC+ program - assume an active file exists with
five independent random vars created by compute r1=normal(1).
correlations r1 to r5 / statistics=descriptives.
* This correlation matrix should be nearly an identity matrix, 1's in
the diagonal and 0's elsewhere.
factor variables = r1 to r5 / print = default det
/criteria = factors(5) /save=reg (all,pr).
correlations pr1 to pr5 / statistics=descriptives.
* This correlation matrix should be an identity matrix.
* use series of computes to post-mult data matrix by
a Cholesky matrix calculated outside spss .
compute nr1 = pr1.
compute nr2 = 0.4*pr1 + 0.91652*pr2 .
compute nr3 = 0.3*pr1 + 0.30551*pr2 + 0.9037*pr3.
compute nr4 = 0.2*pr1 + 0.24004*pr2 + 0.29508*pr3 + 0.90294*pr4.
compute nr5 = 0.1*pr1 + 0.17457*pr2 + 0.23976*pr3 + 0.29608*pr4 +
0.90243*pr5.
correlations nr1 to nr5 / statistics=descriptives.
* This correlation matrix should be like the target correlation matrix.
> -----Original Message-----
> From: Frank H. Jurden [SMTP:fhj@QNI.COM]
> Sent: Tuesday, February 02, 1999 1:04 AM
> To: SPSSX-L@UGA.CC.UGA.EDU
> Subject: Re: Correlated sampling?
>
> a good place to start is the lisrel/prelis software manual. it includes a
> data
> generator which can reproduce mulitvariate normal and non-normal data sets
> according to the means/sds/and correlation matrices you desire.
>
> hope this helps.
>
> frank
>
> Merjet wrote:
>
> > I have a question about Monte Carlo simulation, which I haven't found an
> answer
> > to. Suppose one wants to simulate together two random variables, both of
> which
> > are Normal(0,1). If their correlation is -1, 0, or 1, it is trival. But
> how
> > should one proceed if the correlation is not one of these? Linear
> treatment
> > seems plausible, but a co-worker believes that isn't right, but can't
> back it
> > up. In other words, if X1 and X2 are the independent random draws, how
> does
> one
> > get the corresponding correlated random variables?
> >
> > Any help? Any references to book or articles would be much appreciated,
> too.
> >
> > Merjet@aol.com