Today’s post is the second half of yesterday’s and carries on where that left off. The combined version turned out to be too long for a single post so I decided at the last minute to split it in two.
Normal scores
Next we need to generate the ‘dummy’ sample. This is an \( n \times m \) matrix each row of which represents one realisation of the random vector \( X \), which has \( m \) components, \( X_{1}, X_{2}, \ldots X_{m}\), each of which is standard normal. Because our objective is to establish a set of rank orders, which we will then transfer to our actual random data sets, \( X\) does not actually need to be random. Iman and Conover proposed using what they called ‘normal scores’. This is a sequence of numbers, \( \left\{ x_{i} \right\}\), generated as follows:
\[ \begin{equation} x_{i}=\Phi^{-1}\left(\frac{i}{n+1}\right) \label{eqn:37} \end{equation} \]
where \( \Phi\) is shorthand for the cumulative distribution function of the standard normal random variable. This sequence is simply the application of the inverse transform to a set of equally spaced points in \( (0, 1)\). Though not random it is normally distributed. Figure 8 below shows what the sequence looks like on a number line (the y-axis) and how it relates the the argument of the function, which is shown on the x-axis. In the figure \( n = 51\) so the points on the x-axis are the sequence of points \( \left\{ \frac{1}{52},\,\frac{2}{52},\,\ldots,\,\frac{51}{52}\right\}\).
If a large enough sequence of normal scores is plotted on a histogram with bins wide enough to contain a decent number of points, then it will look exactly like a standard normal p.d.f. This seems intuitive from looking at Figure 8. The sequence is symmetrical about zero, that is
\[ \begin{equation} \Phi^{-1}\left(\frac{i}{n+1}\right)=-\Phi^{-1}\left(1 – \frac{i}{n+1}\right) \label{eqn:38} \end{equation} \]
which means that we only need to generate the first \( n/2 \) values and copy them, with their signs reversed, to the remainder of the range.
Normalisation
Despite being distributed according to the standard normal, the sequence of normal scores does not have unit variance, though it tends to it as the number of points increases. It therefore needs to be normalised to make it do so. To effect the normalisation, we replace \( x\) by \( x/k\) where
\[ \begin{equation} k=\sqrt{\frac{2}{n}\sum_{i=1}^{\mathrm{floor}\left(n/2\right)}\left(\Phi^{-1}\left(\frac{i}{n+1}\right)\right)^{2}} \label{eqn:39} \end{equation} \]
\( k\) is computationally expensive to calculate, but this doesn’t matter because it only needs to be calculated once.
Columns 2 to n
Again because we are only using these data to get rank orderings, it doesn’t matter if all the columns of \( X\) are the same, so we can simply copy the first column to the others.
Shuffling
We now have a sample matrix of normal scores with unit variance, but they are all ordered, and so perfectly correlated with each other. We want the opposite, i.e. for them to be completely uncorrelated. To achieve this, they must be randomly shuffled. We do this using the following algorithm.
- Step 1:
- start at the end of the array
- Step 2:
- choose a position randomly from earlier in the array
- Step 3:
- swap the two values
- Step 4:
- move to the next position down
Repeat steps 2 to 4 the above process until you reach the beginning of the array.
The code that implements this is as follows (leaving out declarations, tests and other boring stuff like that, that would obscure what the function actually does):
Public Function shuffle(vArray, column) startIndex = LBound(vArray, 1) endIndex = UBound(vArray, 1) j = endIndex Do While j > startIndex k = CLng(startIndex + Rnd() * (j - startIndex)) temp = vArray(j, column) vArray(j, column) = vArray(k, column) vArray(k, column) = temp j = j - 1 Loop End Function
This function shuffles a specified column of a two dimensional array. It must be applied once for each column to be shuffled.
The columns will still be uncorrelated if one of them is left unshuffled, and it is usual practice to do this to save a bit of computational effort. If we only have one set of input variables with one big correlation matrix then this is fine. However, if we have two or more subsets that each contain variables correlated only with members of that subset, and if each subset has an unshuffled column, then the two unshuffled columns will be perfectly correlated with each other even though they are supposed to be completely uncorrelated. To avoid this, in this project we shuffle all of the columns.
Inducing correlation
The next step is to make the array of ‘normal scores’ correlated. We have already discussed this in detail. It is done by right-multiplying \( X\) by the matrix \( Z\) as described above.
Calculating rank orders
Having got our matrix of randomised but correlated normal scores, the next step is to calculate its rank orders. Excel has a worksheet function that will do this to arrays of cells in a worksheet, but not one that will do it to a VBA array in memory. We therefore need to write one. To do this, it is necessary to sort each of its columns in parallel with a second array that consists of an ordered sequence of integers from \( 1\) to \( n\). In what follows the column of numbers whose rank orders we wish to find is called the ‘key array’ and the column of integers sorted in parallel with it is called the ‘index array’.
Before the sort, the ith position in the key array contains its ith value (in the original order). The ith position in the index array contains the integer i. After the sort, the ith position in the key array contains its ith largest value, and the ith position in the index array contains an integer denoting the location before the sort of the ith largest value in the key array. Table 3 below summarises this.
Key array | Index array | |
Before sort | ith value in original order | The integer i |
After sort | ith largest value | Integer representing the position, in the original order, of the ith largest value in the key array |
Table 3: Contents of ith location in the two arrays before and after sorting
If we define a new function Index(.)
to return the value of the index array after the sort, then Index(i)
contains the location in the key array, when it was in its original order, of the ith largest of its values. If Index(i) = k
, then the rank order of the kth member of the key array, when it was in its original order, is i. If we declare a new array Rank(.)
and then populate it by assigning:
\[ \begin{equation} Rank(Index(i)) = i \label{eqn:40} \end{equation} \]
then it will contain our desired rank orders. That is Rank(j)
will be the rank order of the j
th member of the unsorted array. In other words, the j
th member of the array is the Rank(j)
th largest value in the array.
VBA doesn’t have a built in function for sorting arrays, either. The only way to do it natively would be to paste the array into a worksheet, sort the worksheet and then read the contents of the sorted worksheet back into the VBA array. This would be messy, inelegant and probably slower. Another option would be to link our VBA code to an external library that contained an implementation of an appropriate sorting algorithm. For the reasons discussed in Post 1, I have decided not to do this.
It is therefore necessary to write our own sorting function. For this purpose, I have chosen to use the ‘Quicksort’ algorithm. This works by dividing the array into two subsets and swapping members between them until all the members of one are greater than all the members of the other. It is then applied recursively to each subset until the entire array is sorted (see https://en.wikipedia.org/wiki/Quicksort). The implementation I use here is based on one that appeared in Stack Overflow (see http://stackoverflow.com/questions/152319/vba-array-sort-function).
We wish to sort one column of a multi-column array at a time, leaving the others in their original order, and to apply the same permutations to the same column of a second array, which is of the same size and dimensions as the first array and which initially contains integers in increasing order. Because the function is applied recursively to subsets of the original array, it has to take the limits of the subset as arguments. These are initially the upper and lower bounds of the array. To avoid having to enter these explicitly, the recursive function is called from within a ‘wrapper’ function that takes as arguments only the names of the arrays to be sorted.
Our sort function is not just used in ranking columns of arrays, but is required in other parts of the module too. It may be called upon to operate in four distinct circumstances. These are the combinations of:
- the array to be sorted might be one or two dimensional
- the key array might be passed with or without an associated index array
The code therefore needs to be able to handle all four of these circumstances. Unfortunately, this requires four different versions of the sorting function, though we can get away with having only one ‘wrapper’ function, which can test for which of the above four circumstances is being passed and call the appropriate sorting function. For illustration, the sorting function that handles two dimensional arrays with an index array is as follows:
Private Function qs2dd(inLow, inHi, keyArray, column, otherArray) ' Usage: qs2dd(inLow, inHi, keyArray, column, otherArray) ' ' "qs1dd" = quick sort two dimensional double array ' ' Sorts "keyArray" and "otherArray" in place between indices inLow and inHi. ' ' An array, "otherArray" is sorted in parallel, also in-place. "otherArray" must ' be one dimensional and have the same start and end indices as the first dimen- ' sion of "keyArray". ' ' "keyArray" must be of type Double, "column" and "otherArray" must be of type ' long. ' ' This function is based on code suggested in: ' http://stackoverflow.com/questions/152319/vba-array-sort-function Dim pivot As Double, tmpLow As Long, tmpHi As Long, keyTmpSwap As Double Dim otherTmpSwap As Long, keyDims As Long, otherDims As Long tmpLow = inLow tmpHi = inHi pivot = keyArray((inLow + inHi) \ 2, column) Do While (tmpLow <= tmpHi) Do While keyArray(tmpLow, column) < pivot And tmpLow < inHi tmpLow = tmpLow + 1 Loop Do While keyArray(tmpHi, column) > pivot And tmpHi > inLow tmpHi = tmpHi - 1 Loop If (tmpLow <= tmpHi) Then keyTmpSwap = keyArray(tmpLow, column) otherTmpSwap = otherArray(tmpLow) keyArray(tmpLow, column) = keyArray(tmpHi, column) otherArray(tmpLow) = otherArray(tmpHi) keyArray(tmpHi, column) = keyTmpSwap otherArray(tmpHi) = otherTmpSwap tmpLow = tmpLow + 1 tmpHi = tmpHi - 1 End If Loop If (inLow < tmpHi) Then qs2dd inLow, tmpHi, keyArray, column, otherArray If (tmpLow < inHi) Then qs2dd tmpLow, inHi, keyArray, column, otherArray End Function
The functions that work the other three sets of conditions are simple variations of this and it would not add anything to list them.
The above function is never called directly, except from within its ‘wrapper’ function. This is as follows:
Public Function quicksort(keyArray() As Double, Optional column As Long, Optional otherArray) ' Usage: quicksort(keyArray, column, otherArray) ' ' Sorts keyArray in place. If keyArray is two-dimensional (i.e. a matrix) then ' only the column specified by the optional argument "column" will be sorted. ' ' An optional "otherArray" can be sorted in parallel, also in-place. If ' "otherArray" is used it must be one-dimensional and have the same start and ' end indices as the columns of "keyArray". ' ' "keyArray" must be of type Double, "column" and otherArray must be of type ' long ' DECLARATIONS ' Blah, blah, blah keyDims = NumberOfArrayDimensions(keyArray) ' TESTS ' Rhubarb, rhubarb ' Calculate "inHi" and "inLow" If keyDims = 1 Then inLow = LBound(keyArray) inHi = UBound(keyArray) ElseIf keyDims = 2 Then inLow = LBound(keyArray, 1) inHi = UBound(keyArray, 1) End If ' Call appropriate sort function If keyDims = 1 And IsMissing(otherArray) Then Call qs1ds(inLow, inHi, keyArray) ElseIf keyDims = 1 And Not IsMissing(otherArray) Then Call qs1dd(inLow, inHi, keyArray, otherArray) ElseIf keyDims = 2 And IsMissing(otherArray) Then Call qs2ds(inLow, inHi, keyArray, column) ElseIf keyDims = 2 And Not IsMissing(otherArray) Then Call qs2dd(inLow, inHi, keyArray, column, otherArray) End If End Function
The function NumberOfArrayDimensions()
does what its name implies and finds out whether the array has one, two or more dimensions. It is borrowed from http://www.cpearson.com/excel/VBAArrays.htm, which is gratefully acknowledged and where a listing of it can be found. It works by repeatedly calling the Ubound(X, n)
function where X
is the array and n
is an integer, with increasing values of n
until an error is returned.
Re-ordering the marginals
To do this, we need the marginal sample to be in increasing order. As has been mentioned several times above, latin hypercube using the inverse transform method produces a sample that is natively in increasing order. If we were to use any other method we would have to sort the sample, which would add computational expense. Let X
be the ‘dummy’ array of randomised & correlated ‘normal scores’, and let Y
be the marginal sample array. Assuming Y
is in increasing order, we copy it to another array as follows:
\[ \begin{equation} Y'(i) = Y(Rank(i)) \label{eqn:41} \end{equation} \]
For example, if the 17th element of X
, X(17)
, contains the 147th largest member of X
, then we want Y'(17)
to contain the 147th largest member of Y
. Because Y
is in increasing order, the 147th largest member of Y
is Y(147)
, we can therefore do this by assigning Y'(17) = Y(147)
. The array Rank(i)
contains the rank orders of Y
and so Rank(17) = 147
and so on. So, generally, this would be Y'(i) = Y(Rank(i))
. We now have our sample matrix correlated as desired.
The code that does this is as follows:
Public Function arrayrank(vArray() As Double) As Long() ' Usage: Y = arrayrank(vArray) ' ' Returns an array of longs representing the the rank orders of the elements ' of vArray. If vArray is two-dimensional, then it returns an array with ' the same number of rows and columns with the columns containing the ranks ' of the corresponding columns of "vArray". ' DECLARATIONS ' Blah, blah, blah ' TESTS & CHECKS ' Rhubarb, rhubarb ' Check that dimensionality of vArray is either 1 or 2 vDims = NumberOfArrayDimensions(vArray) ' Create array of consecutive longs from startIndex to endIndex If vDims = 1 Then ReDim vRank(1 To n) ReDim tmpRank(1 To n) ElseIf vDims = 2 Then ReDim vRank(1 To n, 1 To m) ReDim tmpRank(1 To n) End If ' Pass this array to QuickSort along with the array to be ranked If vDims = 1 Then For i = 1 To n tmpRank(i) = i Next i Call quicksort(keyArray:=vArray, otherArray:=tmpRank) For i = 1 To n vRank(tmpRank(i)) = i Next i ElseIf vDims = 2 Then For j = 1 To m For i = 1 To n tmpRank(i) = i Next i Call quicksort(keyArray:=vArray, column:=j, otherArray:=tmpRank) For i = 1 To n vRank(tmpRank(i), j) = i Next i Next j End If arrayrank = vRank End Function
Finally we need to write a function to perform the overall Iman-Conover technique on the sample. This is simply a sequence of calls to the functions we have already written. You may also notice that it contains calls to functions called matmult(A, B)
and mattransmult(A, B)
. These perform the matrix multiplications \( AB\) and \( A^{\mathsf{T}}B\) respectively. I haven’t listed these because they are fairly obvious.
Public Function ic(Xascending() As Double, C() As Double) As Double() ' Usage: y = ic(Xascending, C) ' ' Performs the Iman-Conover method on Xind and returns a matrix Xcorr with the same ' dimensions as Xind. ' ' Xascending is an n-instance sample from an m-element random row-vector, with ' each column sorted in ascending order. ' ' If the columns of Xascending are in ascending order then the correlation ' matrix of Xcorr will be approximately equal to C. This function does not test ' Xascending to check that its columns are in ascending order. To do so would ' incur too large a computational burden. ' ' C must be square, symmetric and positive definite. ' ' The number of rows and columns of C must equal the number of columns Xascending. ' DECLARATIONS GO HERE ' Blah, blah, blah nRowsX = UBound(Xascending, 1) - LBound(Xascending, 1) + 1 nColsX = UBound(Xascending, 2) - LBound(Xascending, 2) + 1 ' MORE DECLARATIONS GO HERE ' Blah, blah, blah ' TESTS GO HERE ' Rhubarb, rhubarb ' END OF TESTS: Actual maths starts here! ' Calculate the upper triangular Cholesky root of C S = chol(C) ' Calculate the matrix, MX, of "normal scores" MX = normalscores(nRowsX, nColsX) ' Calculate the matrix EX = MX' * MX EX = mattransmult(MX, MX) ' Calculate Fx, the Cholesky root of Ex. FX = chol(EX) ' Calculate ZX = FX^{-1) * S ZX = finvs(FX, S) ' Calculate TX, the reordered matrix of "scores". TX = matmult(MX, ZX) ' Calculate the rank orders of TX. ranks = arrayrank(TX) ' Reorder columns of X to match T. For j = 1 To nColsX For k = 1 To nRowsX YX(k, j) = Xascending(ranks(k, j), j) Next k Next j ic = YX End Function
Next we need to use the algorithms we have developed above to produce a practical working application. This is the subject of my next post.
I’m having a day off tomorrow, so that will probably be on Monday.
© 2015 Howard J. Rudd all rights reserved.
Wonderful Website.