## Avoid loops, avoid the APPLY function, vectorize!

Last week I received a message from SAS Technical Support saying that a customer's IML program was running slowly. Could I look at it to see whether it could be improved?

What I discovered is a good reminder about the importance of vectorizing user-defined modules. The program in this blog post is not the customer's, but it illustrates the main ideas that I'd like to discuss. Consider the following SAS/IML module:

```proc iml; start MinCDF(x, y); a = min(x, y); p = cdf("Normal", a); z = min(p, 0.5); return(z); finish;```

The function takes two scalar arguments, x and y, and does the following:

• It assigns the smaller of x and y to the variable a.
• It calls the CDF function to compute the probability that a normal random variable will be observed to have a value less than or equal to a. Numerically, this requires evaluating the area under the normal density curve from minus infinity to a.
• It returns that probability or 0.5, whichever is smaller.

The details of the function are not important. The important facts are

• It uses the MIN function to return the smaller of two scalar values.
• It performs a computationally expensive operation. In this case, the CDF call.
• It is a scalar function: each argument represents a scalar value and it returns a scalar value.

The function is not vectorized, which means that you cannot pass in two vectors for x and y and get back a vector of results. (Try it!) The problem is the MIN function, which returns the smallest value (a scalar) from among all elements of x and y.

If you want to call the function on many pairs of x and y values, you have two options: call the function in a loop or rewrite it so that it will return a vector of results when passed vectors for input arguments. For example, you might want to evaluate the function on a fine grid of values in order to visualize the function as a surface over the xy-plane, as shown to the left. The following statements evaluate the function in a loop. The loop calls the function more than 360,000 times, and takes about a second:

```xy = ExpandGrid( do(-3,3,0.01), do(-3,3,0.01) ); /* grid of (x,y) values */ x = xy[,1]; y = xy[,2]; z = j(nrow(xy),1); /* allocate vector for results */   t0 = time(); do i = 1 to nrow(xy); z[i] = MinCDF(x[i], y[i]); end; t_loop = time() - t0;```

The customer's program did something equivalent. The customer called the APPLY function to avoid writing a loop. If you have a function that takes scalar-valued arguments and returns a scalar argument, you can use the APPLY function to evaluate the scalar-valued function at many input values, as follows:

`z = apply("MinCDF", x, y);`

The APPLY function requires about the same amount of time to run as the loop. The APPLY function merely prevents you from writing the loop.

For this example, you can call the MinCDF function 360,000 in about one second. The customer's function was more computationally expensive, and the customer was using the APPLY function to call the function millions of times. Consequently, the program was taking a long time to run.

Experienced SAS/IML programmers are familiar with the potential advantages of vectorizing a computation. For this computation, the key to vectorizing the function was to eliminate the call to the MIN function and instead call the elementwise minimum operator (><), which will return the vector of minimums when x and y are vectors. The new vectorized module is below:

```start VecMinCDF(x,y); a = x >< y; /* elementwise minimum */ p = cdf("Normal", a); /* vector of probabilities */ z = p >< 0.5; /* elementwise min of cdf and 0.5 */ return(z); finish;   t0 = time(); z = VecMinCDF(x, y); t_vec = time() - t0;```

Calling the vectorized function on two vectors of size 360,000 requires 0.01 seconds, which is a hundredfold speedup. For the customer's example, the speedup was even more dramatic. Notice in this case that rewriting the function was trivial: only the call to the MIN function was replaced.

Someone asked me the other day if it always possible to vectorize a computation. Unfortunately, the answer is "no." For some iterative algorithms, the second element of a result depends on the value of the first element. Algorithms of that sort cannot be vectorized, although you might be able to vectorize parts of the computation.

This example teaches an important lesson: although the APPLY function makes it easy to call a scalar-valued function on a vector of arguments, the convenience might come at a price. The APPLY function is essentially equivalent to calling a scalar-valued function in a loop, which can lead to poor performance. If you want to call a function with scalar arguments many times with different input values, investigate whether it is possible to rewrite the function so that it is vectorized. The speedup can be dramatic.

Post a Comment

## Plotting multiple time series in SAS/IML (Wide to Long, Part 2)

I recently wrote about how to overlay multiple curves on a single graph by reshaping wide data (with many variables) into long data (with a grouping variable). The implementation used PROC TRANSPOSE, which is a procedure in Base SAS.

When you program in the SAS/IML language, you might encounter data in a matrix with many columns that you need to reshape. The SAS/IML language provides several functions for reshaping data:

In the PROC TRANSPOSE example, the data were reshaped by reading the data set in row-major order: The kth observation for the first variable is followed by the kth observation for the second, third, and fourth variables, and so forth. Consequently, the default behavior of the following WideToLong module is also to interleave the column values. However, the module also enables you to specify that the data should be stacked columnwise. That is, the reshaped data consists of all observations for the first variable, followed by all observations for the second variable, and so forth.

Some of the input variables might be numeric whereas others might be character, so the following module handles the X, Y, and grouping variables in separate matrices. The module returns three matrices: The long form of the Y matrix, the replicated X values, and the replicated grouping (or ID) variable. By convention, output variables are listed first in the argument list, as follows:

```proc iml; start WideToLong(OutY, OutX, OutG, /* output vars */ Y, X=T(1:nrow(Y)), Group=1:ncol(Y), stack=0); /* input vars */ p = ncol(Y); N = nrow(Y); cX = colvec(X); cG = colvec(Group); if stack then do; /* stack Y in column-major order */ OutY = shapecol(Y, 0, 1); /* stack cols of Y */ OutX = repeat(cX, p); /* replicate x */ OutG = colvec(repeat(cG, 1, N)); /* duplicate Group ID */ end; else do; /* DEFAULT: interleave Y in row-major order */ OutY = shape(Y, 0, 1); /* interleave cols of Y */ OutX = colvec(repeat(cX, 1, p)); /* replicate x */ OutG = repeat(cG, N); /* duplicate Group ID */ end; finish;```

Let's see how the module works by testing it on the same data as in the previous blog post. For the Sashelp.Tourism data, each observation is a year in the range 1966–1994. The year is stored in a variable named Year. (Technical note: The Year data is stored as a SAS date value, so the example uses the YEAR function to convert the date value to an integer.) The exchange rates for the British pound and the Spanish peseta are stored in separate variables named ExUK and ExSp, respectively. The following statements read the data into SAS/IML matrices, and then convert the data from wide to long form by calling the WideToLong routine:

```YNames = {ExUK ExSp}; GroupValues = {"UK" "Spain"}; XName = "Year";   use Sashelp.Tourism; read all var YNames into Y; read all var XName into X; close Sashelp.Tourism;   /* For these data, X is a SAS date value. Convert to integer. */ X = year(X); run WideToLong(ExchangeRate, Year, Country, /* output (long) */ Y, X, GroupValues); /* input (Y is wide) */```

As shown in the previous blog post, a useful application of reshaping data is that it is easy to overlay multiple line plots on a single graph. In the SAS/IML language, the SERIES statement creates series plots and supports the GROUP= option for overlaying multiple lines. The following statement creates a time series of the two exchange rates between 1966 and 1994:

```title "Exchange Rate Indices"; call series(Year, ExchangeRate) group=Country;```

By default, the data are reshaped in row-major order. To reshape in column-major order, specify the STACK=1 option, as follows:

```run WideToLong(ExchangeRate, Year, Country, /* output (long) */ Y, X, GroupValues) stack=1; /* input (Y is wide) */```
Post a Comment

## Plotting multiple series: Transforming data from wide to long

Data. To a statistician, data are the observed values. To a SAS programmer, analyzing data requires knowledge of the values and how the data are arranged in a data set.

Sometimes the data are in a "wide form" in which there are many variables. However, to perform a certain analysis requires that you reshape the data into a "long form" where there is one identifying (ID) variable and another variable that encodes the value. I have previously written about how to convert data from wide to long format by using PROC TRANSPOSE.

Recently I wanted to plot several time series in a single graph. Although PROC SGPLOT supports multiple SERIES statements, it is simpler to use the GROUP= option in a single SERIES statement. The GROUP= option makes it convenient to plot arbitrarily many lines on a single graph. This article describes how to reshape the data so that you can easily plot multiple series in a single plot or in a panel of plots.

### Series plots for wide data

The Sashelp sample library includes a data set named Tourism that includes an index of the exchange rates for the British pound and the Spanish peseta versus the US dollar for the years 1966–1994. Each observation is a year. The year is stored in a variable named Year; the exchange rates are stored in separate variables named ExUK and ExSp, for the UK and Spanish exchange rates, respectively. You can use separate SERIES statements to visualize the exchange rates, as follows:

```proc sgplot data=Sashelp.Tourism; series x=Year y=ExUK / legendlabel="UK Pound"; series x=Year y=ExSp / legendlabel="Spanish Pesetas"; yaxis label="Exchange Rate vs Dollar"; run;```

It is not difficult to specify two SERIES statements, but suppose that you want to plot 20 series. Or 100. (Plots with many time series are sometimes called "spaghetti plots.") It is tedious to specify many SERIES statements. Furthermore, if you want to change attributes of every line (thickness, transparency, labels, and so forth), you have to override the default options for each SERIES statement. Clearly, this becomes problematic for many variables. You could write a macro solution, but the next section shows how to reshape the data so that you can easily plot multiple series.

### From wide to long in Base SAS

A simpler way to plot multiple series is to reshape the data. The following call to PROC TRANSPOSE converts the data from wide to long form. It manipulates the data as follows:

1. Rename ExUK and ExSp variables to more descriptive names (UK and Spain). This step is optional.
2. Transpose the data. Each value of the Year variable is replicated into two new rows. The values of the UK and Spain variables are interleaved into a new variable named ExchangeRate.
3. A new column named Country identifies which exchange rate is associated with which of the original variables.
```proc transpose data=Sashelp.Tourism(rename=(ExUK=UK ExSp=Spain)) /* 1 */ out=Long(drop=_LABEL_ rename=(Col1=ExchangeRate)) /* 2 */ name=Country; /* 3 */ by Year; /* original X var */ var UK Spain; /* original Y vars */ run;```

The original data set contained 29 rows, one X variable (Year) with unique values, and two response variables (ExUK and ExSp). The new data set contains 58 rows, one X variable (Year) with duplicated values, one response variable (ExchangeRate), and an discrete variable (Country) that identifies whether each exchange rate is for the British pound or for the Spanish Peseta. For the new data set, you can recreate the previous plot by using a single SERIES statement and specifying the GROUP=COUNTRY option:

```proc sgplot data=Long; label ExchangeRate="Exchange Rate vs Dollar" Country="Country"; series x=Year y=ExchangeRate / group=Country; yaxis label="Exchange Rate"; run;```

The resulting graph is almost identical to the one that was created earlier. The difference is that now it is easier to change the attributes of the lines, such as their thickness or transparency.

### Create a panel of plots

If you want to see each time series in its own plot, you can use the SGPANEL procedure, which also requires that the data be in the wide form:

```proc sgpanel data=Long; label ExchangeRate="Exchange Rate vs Dollar" Country="Country"; panelby Country; series x=Year y=ExchangeRate; run;```

Notice that you do not need to sort the data by the Country variable in order to use the PANELBY statement. However, if you want to conduct a BY-group analysis of these data, then you can easily sort the data by using PROC SORT.

Which do you prefer: multiple SERIES statements or a single SERIES statement that has the GROUP= option? Or maybe you prefer paneled plots? There are advantages to each. Leave a comment.

Post a Comment

## Complete cases: How to perform listwise deletion in SAS

SAS procedures usually handle missing values automatically. Univariate procedures such as PROC MEANS automatically delete missing values when computing basic descriptive statistics. Many multivariate procedures such as PROC REG delete an entire observation if any variable in the analysis has a missing value. This is called listwise deletion or using complete cases. Some procedures such as PROC FREQ and PROC CORR have options that control the way that missing values are handled in the statistical analysis.

Because SAS procedures internally handle which observations are kept or deleted, there is not usually a need for SAS programmers to create a data set that contains all of the complete cases. In contrast, SAS/IML programmers often have to worry about missing values. When implementing a multivariate algorithm, the first step is often to delete rows of the data matrix that contain missing values. This article shows how to perform listwise deletion by using the DATA step and PROC IML.

The case of missing values in numerical data is the most important case, so this article uses the following data set. The first, fourth, and fifth observations represent complete cases.

```data A; input x1-x5; datalines; 1 2 3 4 5 2 2 . 4 5 3 3 3 3 . 4 1 3 2 5 5 2 1 3 4 6 . . . 7 . 8 9 0 . . . . . . 9 8 7 . . ;```

### Output complete cases by using the DATA step

If the variables are all numeric, you can use the NMISS function to determine which observations have missing values. You can use the subsetting IF statement to output only those observations that do not contain missing values, as follows:

```data CompleteCases; set A; if nmiss(of _NUMERIC_)=0; /* output complete cases for all numeric vars */ run;   proc print noobs; run;```

In the previous example, all numeric variables are used to determine the complete cases. You can also specify a list of variables to the NMISS function, as follows:

` if nmiss(x1, x2, x5)=0; /* specify list of numeric vars */`

For character variables, you can use the CMISS function and the _CHARACTER_ keyword. In fact, the CMISS function also counts numeric variables, so it enables you to handle numeric and character variables with a single call. The following DATA step outputs the complete cases for all variables in the Sashelp.Heart data set:

```data CompleteCases; set Sashelp.Heart; if cmiss(of _ALL_)=0; /* complete cases for all vars */ run;```

Again, you can specify a list of variable names to the CMISS function if you need to analyze on certain variables.

### Complete cases in the SAS/IML language

In the SAS/IML language, you can use matrix subscripts to extract specific rows of a matrix. Therefore, the following function returns the indices for rows that do not contain missing values. The COUNTMISS function is used to count the number of missing values in each row. The LOC function returns the index for the rows that do not contain missing values, as follows:

```proc iml; /* return rows that have no missing values */ start CompleteCases(X); return( loc(countmiss(X, "row")=0) ); finish;   use A; read all var _NUM_ into X; close A; nonMissingRows = CompleteCases(X); print nonMissingRows;```

The CompleteCases function makes it trivial to extract rows of a matrix that are complete cases. Because the COUNTMISS function accepts both numerical and character matrices, the CompleteCases function works for all matrices. The following function encapsulates the extraction of complete cases:

```/* exclude any row with a missing value */ start ExtractCompleteCases(X); idx = CompleteCases(X); if ncol(idx)>0 then return( X[idx, ] ); else return( {} ); finish;   Y = ExtractCompleteCases(X); if ncol(Y)>0 then print Y; else print "No complete cases";```

Have you ever needed to physically construct a matrix or data set that has only complete cases? What was the application? Leave a comment.

Post a Comment

## Monitor the progress of a long-running SAS/IML program

When you have a long-running SAS/IML program, it is sometimes useful to be able to monitor the progress of the program. For example, suppose you need to computing statistics for 1,000 different data sets and each computation takes between 5 and 30 seconds. You might want to output a message after every 100th computation so that you know at a glance whether the program is close to finishing. (You might also want to check out some efficiency tips and ways to vectorize your program.)

Now some of you might be thinking, "What's the problem? Just put a PRINT statement inside the loop!" Right idea, but it doesn't work like you might expect. For efficiency, SAS/IML loops are compiled and executed in a very efficient manner. One of the consequences of this efficiency is that ODS output is not flushed until the entire DO loop has completed! Run the following program (which takes 10 seconds) and notice that you get NO output until the loop completes, at which point you will see all 10 messages displayed at once:

```proc iml; do i = 1 to 1000; if mod(i,100)=0 then do; print "Iteration= " i; end; call sleep(1, 0.01); /* delay for 1/100th second */ end; print "Execution finished";```
OUTPUT

So how can you force SAS/IML to display information while the loop is executing? I know two ways. In the IMLPlus language (available in the SAS/IML Studio application), you can use the PRINTNOW statement to flush the ODS output. In PROC IML, you can use the SUBMIT and ENDSUBMIT statements to write messages to the SAS Log.

### The IMLPlus solution: The PRINTNOW statement

In the SAS/IML Studio environment, the IMLPlus language supports the PRINTNOW statement. The PRINTNOW statement tells the IMLPlus program to retrieve and display any pending output from the SAS Workspace server. In the IMLPlus language, the following statement prints the iteration history while the DO loop is executing:

```do i = 1 to 1000; if mod(i,100)=0 then do; print "Iteration= " i; printnow; /* IMLPlus statement: works only in SAS/IML Studio */ end; call sleep(1, 0.01); /* delay for 1/100th second */ end;```

### The PROC IML solution: The SUBMIT/ENDSUBMIT statements

If you are running PROC IML as part of a larger SAS program, you can use the SUBMIT/ENDSUBMIT statements to write messages to the SAS Log. Recall that you can pass values from SAS/IML matrices into SAS statements. If you include the name of a SAS/IML variable on the SUBMIT statement, the contents of that variable are substituted into the SAS code before the SUBMIT block is sent to SAS. In the following example, the value of the counter i is sent into a SUBMIT block, and the %PUT macro statement is used to print the iteration value to the SAS Log:

```proc iml; do i = 1 to 1000; if mod(i,100)=0 then do; submit i; %put Iteration= &i; endsubmit; end; call sleep(1, 0.01); /* delay for 1/100th second */ end; print "Execution finished";```

The previous program displays the iteration count in the SAS Log while the program is running.

Do you have long-running programs in SAS? Have you developed a clever way to monitor their progress while they run? Leave a comment about your technique.

Post a Comment

## Friends don't let friends concatenate results inside a loop

Friends have to look out for each other.

Sometimes this can be slightly embarrassing. At lunch you might need to tell a friend that he has some tomato sauce on his chin. Or that she has a little spinach stuck between her teeth. Or you might need to tell your friend discreetly that he should examine his zipper before he presents at the staff meeting.

Yes, it can be awkward to speak up. However, it would be more embarrassing to my friends if I don't tell them these things. What kind of a friend would I be if I allow them to walk around all day with food on their face or an unzipped fly?

I consider many of my readers to be friends, so let me whisper a little secret in your ear: "When you program in a matrix-vector language such as SAS/IML, do not concatenate results inside a loop."

It is the programming equivalent of having spinach in your teeth. People will see it and think "how could he have missed such an obvious thing?" It makes you look bad. Some will snicker.

Concatenation inside a loop is a "rookie mistake" that experienced statistical programmers avoid. The better way to accumulate a vector or matrix of results is to allocate an array for the results prior to the loop and fill the array inside the loop.

The following example illustrates why it is more efficient to use the "allocate and fill" method instead of dynamically growing an array by using concatenation. The programmer wants to call the ComputeIt function 10,000 times inside a loop. The function returns a row vector with 100 elements. The first group of statements times how long it takes to dynamically grow the matrix of results. The second group of statements times how long it takes to allocate the matrix outside the loop and fill each row inside the loop:

```proc iml; NumIters = 10000; start ComputeIt(n); return( n:(n+99) ); finish;   /* time how long it takes to dynamically grow an array */ t0 = time(); free result; do i = 1 to NumIters; result = result // ComputeIt(i); /* inefficient */ end; tConcat = time() - t0;   /* same computation, but allocate array and fill each row */ t0 = time(); result = j(NumIters, 100); do i = 1 to nrow(result); result[i,] = ComputeIt(i); /* more efficient */ end; tAlloc = time() - t0;   print tConcat tAlloc;```

For this example, forming the result matrix by using concatenation takes almost 3 seconds. Filling rows of a pre-allocated array takes about 0.02 seconds for the same computation. The computations and the results are identical, but one method is one hundred times faster.

The results are not always this dramatic. If the ComputeIt function returns a tiny vector (such as return(2#n);), then the difference in run time is not as extreme. Similarly, if the ComputeIt function takes a long time to run, then it won't matter much whether you concatenate or pre-allocate the result. If each iteration of the loop takes 1 second, then the difference between 10,003 seconds and 10,000 seconds is inconsequential.

Nevertheless, I recommend that you adopt the "allocate and fill" technique as a good programming habit. Good habits pay off and make you look professional. And if someone points to your program and asks, "Why are you doing it this way?", you can pull them aside and whisper, "Friend, let me tell you a secret...."

Post a Comment

## Binary heart in SAS

The xkcd comic often makes me think and laugh. The comic features physics, math, and statistics among its topics. Many years ago, the comic showed a "binary heart": a grid of binary (0/1) numbers with the certain numbers colored red so that they formed a heart.

Some years later, I used an equation in polar coordinates to draw a heart in SAS as a Valentine's Day gift to my wife. This year I thought it would be fun to combine these two ideas into a geeky Valentine's Day post.

The following SAS DATA step generates on a uniform grid of points near the origin. You can create an indicator variable that specifies which elements of the grid are inside the heart equation. By using the RAND function, you can generate a random 0 or 1 according to the Bernoulli distribution. You can then call the SGPLOT procedure to show the binary values, using red to display those that are inside the heart region and light gray to display the values outside the heart. The resulting image is shown at the top of this article.

```/* generate random binary heart similar to xkcd comic: http://xkcd.com/99/ */ data BinaryHeart; drop Nx Ny t r; Nx = 21; Ny = 23; call streaminit(2142015); do x = -2.6 to 2.6 by 5.2/(Nx-1); do y = -4.4 to 1.5 by 6/(Ny-1); /* convert (x,y) to polar coordinates (r, t) */ r = sqrt( x**2 + y**2 ); t = atan2(y,x); /* eqn: blogs.sas.com/content/iml/2011/02/14/a-parametric-view-of-love/ */ Heart= (r < 2 - 2*sin(t) + sin(t)*sqrt(abs(cos(t))) / (sin(t)+1.4)) & (y > -3.5); B = rand("Bernoulli", 0.5); output; end; end; run;   ods graphics / width=400px height=500px; title "Happy Valentine's Day"; proc sgplot data=BinaryHeart noautolegend; styleattrs datacontrastcolors=(lightgray red); scatter x=x y=y / group=Heart markerchar=B markercharattrs=(size=14); xaxis display=none offsetmin=0 offsetmax=0.06; yaxis display=none; run;```

The image is similar to the original hand-drawn xkcd comic, although the random binary values are different. I don't know what a "binary heart" means, but I find it interesting nonetheless.

Notice the use of the STYLEATTRS statement in PROC SGPLOT to set the colors of the marker characters. The STYLEATTRS statement was added in SAS 9.4. In SAS 9.3 you can use the DATTRMAP= option in the PROC SGPLOT statement to specify the colors of groups in SAS statistical graphics.

Post a Comment

## Creating an array of matrices

The SAS DATA step supports multidimensional arrays. However, matrices in SAS/IML are like mathematical matrices: they are always two dimensional. In simulation studies you might need to generate and store thousands of matrices for a later statistical analysis of their properties. How can you accomplish that unless you can create an array of matrices?

A simple solution is to "flatten" each matrix into a row vector and form a big array where the ith row of the array contains the elements of the ith matrix. This storage scheme will be familiar to programmers who have used SAS/IML functions for generating random matrices, such as the RANDWISHART function or my program for generating random symmetric matrices.

### Matrices packed into an array

For example, the following example generates 10 random 2 x 2 matrices from a Wishart distribution with 7 degrees of freedom.

```proc iml; call randseed(1234); DF = 7; S = {1 1, 1 5}; X = RandWishart(1000, DF, S); /* generate 1,000 2x2 matrices */ print (X[1:10,])[label="X" colname={X11 X12 X21 X22}];```

Each row of the X matrix contains the elements of a 2 x 2 matrix in row-major order. That is, the first element each the row is the (1,1)th element of the matrix, the second is the (1,2)th element, the third is the (2,1)th element, and the fourth is the (2,2)th element.

### Extracting matrices from an array

You can use the subscript operator to extract a row from the X array. For example, X[1,] contains the values of the first 2 x 2 matrix. You can use the SHAPE function to reshape each row. This might be necessary if you want to multiply with the matrices or compute their determinant or eigenvalues. For example, the following loop iterates over the rows of the array and computes the eigenvalues of each 2 x 2 matrix. You can use the HISTOGRAM subroutine to plot the distribution of the eigenvalues:

```/* compute and store eigenvalues of the 2x2 matrices */ v = j(nrow(X), sqrt(ncol(X)), .); /* each pxp matrix has p eigenvalues */ do i = 1 to nrow(X); A = shape(X[i,], 2); /* A is symmetric 2x2 matrix */ v[i,] = T(eigval(A)); /* find eigenvalues; transpose to row vector */ end;   title "Distribution of Eigenvalues for 1,000 Wishart Matrices"; call histogram(v) rebin={2.5,5};```

The histogram shows a mixture of two distributions. The larger eigenvalue has a mean of 37. The smaller eigenvalue has a mean of 4.5.

### Packing matrices into an array

If you have many matrices that are all the same dimension, then it is easy to pack them into an array. Often your program will have a loop that creates the matrices. Within that same loop you can flatten the matrices and pack them into an array. In the following statements, the ROWVEC function is used to convert a square covariance matrix into a row vector.

```Z = j(1000, 4, .); /* allocate array for 1,000 2x2 matrices */ do i = 1 to nrow(Z); Y = randnormal(8, {0 0}, S); /* simulate 8 obs from MVN(0,S) */ A = cov(Y); /* A is 2x2 covariance matrix */ Z[i,] = rowvec(A); /* flatten and store A in i_th row */ end;```

The examples in this article have shown how to create a one-dimensional array of matrices. The trick is to flatten the ith matrix and store it in the ith row of a large array. In a similar way, you can build two- or three-dimensional arrays of matrices. Keeping track of which row corresponds to which matrix can be tricky, but the NDX2SUB and SUB2NDX functions can help.

For an alternative technique for working with multidimensional arrays in SAS/IML, see the excellent paper by Michael Friendly, "Multidimensional arrays in SAS/IML Software."

Post a Comment

## Specify the order of variables at run time in SAS

In SAS, the order of variables in a data set is usually unimportant. However, occasionally SAS programmers need to reorder the variables in order to make a special graph or to simplify a computation.

Reordering variables in the DATA step is slightly tricky. There are Knowledge Base articles about how to do it. There are papers written about how to do it. There is a page at sasCommunity.org dedicated to how to do it. Reordering variables when the order is not known until run time is even trickier. A typical solution is to construct a macro variable that contains the desired order and then use the LENGTH or RETAIN statement in a separate DATA step to set the order.

In contrast, when you read a SAS data set into a SAS/IML matrix, you can easily specify the order of the variables.

Here's a simple example. The following SAS/IML statements read numerical data from the Sashelp.Class data set into matrices. The columns of matrix X correspond to the variables Age, Height, and Weight (in that order), whereas the columns of matrix Y correspond to the variables Weight, Height, and Age:

```proc iml; use Sashelp.Class; read all var {Age Height Weight} into X; read all var {Weight Height Age} into Y; close;```

If there are k numerical variables in a data set, there are k! different matrices that you can create. Each matrix has the same elements, but the columns are permuted. (Programming Challenge: Write a SAS/IML program that creates all k! matrices.)

In the previous example, the order of the variables were hard-coded by the programmer. But it is just as easy to set the order of the variables at run time. And no macro variable is required! You merely create a character vector that contains the variables names in the order that you want.

Here's an example that is similar to a recent question that was posted on the SAS Support Communities. The data set named VECTOR contains observations in that are not in alphabetical order. (Perhaps the data set was sorted by the X variable.) Another data set named MATRIX contains a matrix where the variables are in alphabetical order.

```data vector; input ID \$ x; cards; D 1 C 2 B 3 E 4 A 5 ;   data matrix; input A B C D E; datalines; 1.0 0.8 0.6 0.4 0.2 0.8 1.0 0.8 0.6 0.4 0.6 0.8 1.0 0.8 0.6 0.4 0.6 0.8 1.0 0.8 0.2 0.4 0.6 0.8 1.0 ;```

The programmer who posed this question wanted to ensure that the observations in the X variable and the variables in the matrix were in the same order so that he could correctly multiply the matrix and vector.

There are two ways to solve the problem: either reorder the observations of the vector or reorder the columns of the matrix. The latter option is the easiest. When you read the VECTOR data set, read the ID variable that specifies the ordering for the vector. Then use that character vector to specify the order of the columns of the matrix, as follows:

```proc iml; use vector; read all var {ID x}; /* ID contains the order */ close vector;   use matrix; read all var ID into M; /* read variables in the same order */ close matrix;   y = M*x; /* do the matrix multiplication */ print y[rowname=ID];```

The vector y contains the product of M*x. The observations in y are in the same order as the observations in x.

The key takeaway is that you can easily read the variables in a data set into a SAS/IML matrix in any order. In particular, you can specify the order at run time based on the values of some other data set or computation.

Post a Comment

## Detect empty parameters that are passed to a SAS/IML module

A SAS/IML programmer asked a question on a discussion forum, which I paraphrase below:

I've written a SAS/IML function that takes several arguments. Some of the arguments have default values. When the module is called, I want to compute some quantity, but I only want to compute it for the parameters that are supplied and are nonempty.

To illustrate the programmer's dilemma, consider the following SAS/IML function:

```proc iml; start Func1(a, b=2, c=); sum = a + b + c; /* WRONG */ return( sum ); finish;```

This function uses optional parameters. The function is supposed to return the sum of three scalar arguments, but there is a problem: the addition operator will not add an empty matrix. (Recall that an empty matrix has no rows or columns.) If the second parameter is supplied but is empty, then the addition operation will fail. Similarly, if the third parameter is skipped, then the default value of c is an empty matrix, and the addition operation also fails. The programmer was asking how to compute the sum over only those parameters that are supplied and are nonempty.

You can detect which parameters are supplied by calling the ISSKIPPED function from within the module. The ISSKIPPED function indicates whether an argument was provided to the module or whether it was not provided (was skipped). The argument to the ISSKIPPED function must be one of the argument names.

The ISSKIPPED function will return 0 (FALSE) if the parameter was passed in, even if the value of the parameter is an empty matrix. However, you can use the ISEMPTY function to detect an empty matrix.

Consequently, the following function computes the sum of the parameters that are supplied and are nonempty:

```start Func(a, b=2, c=); /* compute sum of parameters that are supplied */ sum = 0; if ^isskipped(a) & ^isempty(a) then sum = sum + a; if ^isskipped(b) & ^isempty(b) then sum = sum + b; if ^isskipped(c) & ^isempty(c) then sum = sum + c; return( sum ); finish;   s = Func(1, NULL); /* NULL is an undefined (empty) matrix */ print s;```

The second parameter is not included in the sum because it is an empty matrix. The third parameter is not included in the sum because it was not supplied.

This example is somewhat contrived, but it does show how to use the ISSKIPPED and ISEMPTY functions to detect two properties of arguments to a module.

Post a Comment
• ### About this blog

Rick Wicklin, PhD, is a distinguished researcher in computational statistics at SAS and is a principal developer of PROC IML and SAS/IML Studio. His areas of expertise include computational statistics, statistical graphics, statistical simulation, and modern methods in statistical data analysis. Rick is author of the books Statistical Programming with SAS/IML Software and Simulating Data with SAS.

Follow @RickWicklin on Twitter.

Do you have a SAS programming question? Assistance is available! Ask SAS/IML questions at the SAS/IML Support Community. For other SAS issues, visit the SAS Support Communities.

• ### Subscribe to this blog

Enter your email address:

Other subscription options