The SAS/IML language supports both row vectors and column vectors. This is useful for performing linear algebra, but it can cause headaches when you are writing a SAS/IML module. I want my modules to be able to handle both row vectors and column vectors. I don't want the user to have to know that my module requires, say, a row vector, and I don't want my module to fail if it is called with a column vector. This article shows two quick tips for handling row vectors and column vectors in a uniform manner.
Row vectors versus column vectors: Which are which?
The NCOL and NROW functions are useful for determining the dimensions of a vector. If x is a row vector, then nrow(x)=1 and ncol(x) is the number of elements in x. Conversely, if x is a column vector, then nrow(x) is the number of elements in x. This shows why it is unwieldy to write a module that handles both row and column vectors: there is no function that returns the number of elements in a vector independent of the shape of the vector.
Two useful helper functions
Of course, it is trivial to write such a function. The following function returns the number of elements in a vector, regardless of the vector's shape:
start NumElements(x); return( nrow(x)*ncol(x) ); finish; |
In fact, the same function also returns the number of elements in any nxp matrix.
A related useful function is one that returns the last element in a vector:
start Last(x); return( x[nrow(x)*ncol(x)] ); finish; |
I like to use Last(x) to get the last element of a function, rather than the more obscure x[ncol(x)] or x[nrow(x)], which assume a particular shape for the vector.
The ROWVEC and COLVEC modules
Another technique that I use is calling the ROWVEC or COLVEC module to physically copy the argument of a module into a vector that has a known shape. For example, the following module converts the module argument into a row vector, which means that there are ncol(x) elements in x:
start MyModule(_x); x = rowvec(_x); /* x is a row vector. Continue computations. */ finish; |
Why did I create a new variable instead of reusing the argument variable? Because the SAS/IML language uses a pass-by-reference scheme to send variables into functions. This means that any modifications to the variable in the module also affect the variable that is passed to the module. If I were to write _x = rowvec(_x), the code would change the shape of the matrix that is passed to the MyModule function.
9 Comments
Rick, you have started me thinking, is it better to use a row vector or a column vector? Surely it should not matter, so I am a bit alarmed at the results of the code below suggesting that the same multiplication using a row vector might be up to three times faster than using a column vector. Have I missed something?
As far as I know, there is no performance difference between using a row or a column vector. However, multiplication on the left and multiplication on the right access the matrix data in different ways. One way might be experiencing more cache misses than the other.
So it's down to how my computer accesses it's memory? One way it's pulling in values from consecutive memory locations (presumably this is optimized in some way), the other way it's pulling in data from locations that are 300*8 bytes apart. I think this means that internally the matrices are stored in column major order.
SAS/IML stores matrices in row-major order, but it might be that the vendor library is optimized for column major, thus explaining why one computation is faster thant the other. Numerical algorithm are always sensitive to how matrices are stored and how memory is accessed. This is a fundamental aspect of numerical analysis and matrix computations. SAS will, of course, look into this and strive to make each computation as fast as possible.
SAS/IML uses vendor math libraries (when they are available) for matrix multiplication. For example, you might have heard about Intel's math kernel library (MKL). Thanks to your observation, some of my colleagues are investigating the performance of the math libaries for these two cases.
Many thanks for your help Rick. I was not really thinking of matrix multiplication as a numerical algorithm dependent on memory management, as rightly so, IML encourages you to think about it as a basic operator. Therefore running the code above for the first time came as a genuine shock! But if there are three-fold efficiency savings to be had, it certainly pays to know what is going on under the hood.
Agreed. But keep in mind that each computation in your 300x300 example takes about 0.0001 seconds. For most applications, whether the computation takes 0.0001 or 0.0002 seconds is not going to matter. I certainly wouldn't recommend that you change your programming style to use row vectors rather than column vectors, since column vectors are more convenient and the 0.0001 seconds that you might save are not worth the hassle of restructuring code.
By the way, I ran the same program in other (non-SAS) software languages. They exhibit similar behavior.
In R it barely makes a difference using the following code ( a similar thing happens when you specify cv and rv to be matrices):
n <- 300
s <- 20000
a <- runif(n^2)
dim(a) <- c(n,n)
a <- a + t(a)
cv <- runif(n)
t.row <- system.time(
for(i in 1:s) p <- cv %*% a
)
t.col <- system.time(
for(i in 1:s) q <- a %*% cv
)
t.row;t.col;
That's the same test I ran! The results for SAS/IML and R are similar. In R (on my machine) one method takes twice as long as the other:
For comparison the SAS/IML numbers on the same machine are
R stores matrice in column-major form whereas SAS/IML stores matrices in row-major form, so that's why the results are "opposite": In R, the multiplication is slower when the vector is on the left; in SAS/IML, the multiplication is slower when the vector is on the right.