For each observation, find the variable that contains the minumum value

The other day I encountered an article in the SAS Knowledge Base that shows how to write a macro that "returns the variable name that contains the maximum or minimum value across an observation." Some people might say that the macro is "clever." I say it is complicated. This is a simple problem; it deserves a simpler solution.

This is one of those situations where a SAS/IML implementation is simpler and cleaner than a macro/DATA step solution. The following DATA step creates the data that are used in the SAS Knowledge Base article:

/* Data for Sample 46471: Return the variable name that contains 
              the max or min value across an observation */
data one;
input a b c d e;
cards;
1    3  12   6 15
34 583 294 493  2
;
run;

By inspection, the minimum value in the first row is 1, which occurs for the variable A. In the second row, the minimum value is 2, which occurs for the variable E.

To find the variable for each row that contains the minimum value for that row, you can use the index minimum subscript reduction operator, which has the symbol >:<. The subscript reduction operators are a little-known part of the SAS/IML language, but they can be very useful. The following SAS/IML program begins by reading all numerical variables into a matrix, X. The subscript reduction operator then computes a column vector whose ith element is the column for which the ith row of X is minimal. You can use this column vector as an index into the names of the columns of X.

/* For each row, find the variable name corresponding to the minimum value */
proc iml;
use one;
read all var _NUM_ into X[colname=VarNames];
close one;
 
idxMin = X[, >:<];         /* find columns for min of each row */
varMin = varNames[idxMin]; /* corresponding var names */
print idxMin varMin;

Yes, those two statements compute the same quantity as the complicated macro. And, if you are willing to nest the statements, you can combine them into a single statement:
varNames[X[, >:<]].

Finding the maximum value for each row is no more difficult: simply use the <:> subscript reduction operator.

The macro featured in the Knowledge Base article includes an option to compute with specified variables, rather than all numerical variables. That, too, is easily accomplished. For example, the following statements find the variable that has the largest value among the A, B, and C variables:

varNames = {a b c};
use one;
read all var varNames into X;
close one;
 
idxMax = X[,<:>];
varMax = varNames[idxMax];
print idxMax varMax;

My next post will discuss subscript reduction operators in further details.

To my readers who are SQL experts: Is there a simple way to solve this problem by using PROC SQL? Leave a comment.

Post a Comment

The curious case of random eigenvalues

I've been a fan of statistical simulation and other kinds of computer experimentation for many years. For me, simulation is a good way to understand how the world of statistics works, and to formulate and test conjectures. Last week, while investigating the efficiency of the power method for finding dominant eigenvalues, I generated symmetric matrices to use as test cases. Each element of the matrix was drawn randomly from U[0,1], the uniform distribution on the unit interval. As I checked my work for correctness, I noticed a few curious characteristics about the eigenvalues of these matrices:

  • They have one positive dominant eigenvalue whose value is approximately n/2, where n is the size of the matrix.
  • The other eigenvalues are distributed in a neighborhood of zero.

I noticed that the same pattern held for all of my examples, so I turned to simulation to try to understand what was going on.

The distribution of eigenvalues in random matrices

I had previously explored the distribution of eigenvalues for random orthogonal matrices and shown that they have complex eigenvalues that are (asymptotically) distributed uniformly on the unit circle in the complex plane. I suspected that a similar mathematical truth was behind the pattern I was seeing for random symmetric matrices.

The key statistical observation is that if the elements of a matrix are random variables, then the eigenvalues (which are the roots of a polynomial of those elements) are also random variables. As such, they have an expected value and a distribution. My conjecture was that the expected value of the largest eigenvalue was n/2 and that the expected values of the smaller eigenvalues are clustered near zero.

In a series of simulations, I generated a large number of random symmetric nxn matrices with entries drawn from U[0,1]. For each matrix I computed the n eigenvalues and stored them in a matrix with n columns. Each row of this matrix contains the eigenvalues for one random symmetric matrix. The ith column contains simulated values for the position of the ith eigenvalue. (Recall that the EIGEN routine in SAS/IML software returns the eigenvalues in (descending) sorted order.) The following graph shows the distribution of eigenvalues for 5,000 random symmetric 10x10 matrices:

Each color represents an eigenvalue, and the histogram of that color shows the distribution of the eigenvalue for 5,000 random 10x10 matrices. Notice that the largest eigenvalue is always close to 5. The largest eigenvalue has a distribution that looks like it might be approximately normally distributed. The distributions for the smaller eigenvalues overlap. A typical 10x10 matrix has 4 smaller positive eigenvalues, 4 smaller negative eigenvalues, and one eigenvalue whose expected value seems to be close to zero.

You can convince yourself that this result is reasonable by considering the constant matrix, C, for which every element is identically 0.5. (I got this idea from a paper that I will discuss later in this article.) This matrix C is singular with n-1 zero eigenvalues. Because the sum of the eigenvalues is the trace of the matrix, C has a single positive eigenvalue with value n/2. The random matrices that we are considering have an expected value of 0.5 in each element, so it makes sense that the eigenvalues will be close to the eigenvalues of C.

What happens for large matrices?

I did a few more simulations with matrices of different sizes. The patterns were similar. I suspected that there was an underlying mathematical theorem at work. As for orthogonal matrices, I conjectured that the theorem was probably an asymptotic result: as the size of the matrix gets large, the eigenvalues have certain statistical properties. To test my conjecture, I repeated the simulation for random 100x100 matrices. The following graph shows the distribution of the eigenvalues for 5,000 simulated matrices:

The scale of the plot is determined by the distribution of the eigenvalue near 50=100/2. The other histograms are so scrunched up near zero that you can barely see their colors.

To better understand the expected values of the eigenvalues, I computed the means of each distribution. These sample means estimate the expected values of the eigenvalues for n=100. For my simulated data, I found that the dominant eigenvalue is centered at 50.16. The confidence interval for that estimate does NOT include 50, so I would conjecture that the eigenvalue approaches n/2 from above.

If you plot a histogram of the non-dominant eigenvalues, you get the following graph:

When I saw the shape of that histogram, I was surprised. I expected to see a uniform distribution. However, my intuition was mistaken and instead I saw a shape that curves down sharply at both ends. After I generated eigenvalues for even LARGER matrices, I commented to a colleague that "it looks like the density is semi-circular." However, I had never encountered a semi-circular distribution before.

Conjecture confirmed!

At the end of my previous article, I mentioned a few of these conjectures and asked if anyone knew of a theorem that describes the statistical properties of the eigenvalues of a random symmetric matrix. Remarkably, I had an answer within 24 hours. Professor Steve Strogatz from Cornell University, commented as follows:

Each entry for such a matrix has an expected value of mu= 1/2, and there's a theorem by Furedi and Komlos that implies the largest eigenvalue in this case will be asymptotic to n*mu. That's why you are getting n/2. And the distribution of eigenvalues (except for this largest eigenvalue) will follow the Wigner semicircle law.

The reference Strogatz cites is a 1981 article in Combinatorica titled "The eigenvalues of random symmetric matrices." The first sentence of the paper is "E. P. Wigner published in 1955 his famous semi-circle law for the distribution of eigenvalues of random symmetric matrices." I suppose "famous" is a relative term: I had never heard of the "Wigner semicircle distribution," but it is famous enough to have its own article in Wikipedia.

The paper goes on to formulate and prove theorems concerning the eigenvalues of random symmetric matrices. The theorems explain the phenomena that I noticed in my simulations, including that the dominant eigenvalue is approximately normally distributed and its expected value converges to n/2 from above. See the paper for further details.

SAS Programs for simulation and analysis

If you would like to conduct your own simulations, this section includes the SAS/IML program used to generate the symmetric random matrices. Although I generated the elements from U[0,1], Wigner's result holds for any distribution of elements, as do the results of Furedi and Komlos. The following program writes the eigenvalues to a SAS data set with 5,000 rows and n variables named e1, e2, ..., en. The value of n is determined by the size macro variable.

%let size=10;  /* controls the size of the random matrix */
proc iml;
/* find eigenvalues of symmetric matrices with A[i,j]~U(0,1) */
NumSim = 5000;
call randseed(1);
 
n = &size;   
r = j(n*(n+1)/2, 1);/* allocate array for symmetric elements */
results = j(NumSim, n);
do i = 1 to NumSim;
   call randgen(r, "uniform"); /* fill r with U(0,1) */
   A = sqrvech(r);             /* A is symmetric */
   eval = eigval(A);           /* find eigenvalues */
   results[i,] = eval`;        /* save as ith row */
end;
 
labl = "e1":("e"+strip(char(n))); /* var names "e1", "e2", ... */
create Eigen from results[c=labl];
append from results;
close Eigen;
quit;

The following SAS program is used to analyze the eigenvalues, including making the graphs shown in this article.

/* make a histogram statement for e1-e&n */
%macro eigenhist(n);
%do i = 1 %to &n;
   histogram e&i / transparency = 0.7;
%end;
%mend;
 
/* overlay distributions of e1-e&size */
proc sgplot data=eigen noautolegend;
%eigenhist(&size);
yaxis grid;
xaxis grid label="Eigenvalue";
run;
 
/* dist of expected values */
/* compute mean of each variable (or use PROC SQL or PROC MEANS...) */
proc iml;
use eigen;
read all var ("e1":"e&size") into X;
close eigen;
mean = T(mean(X));
rank = T(do(&size,1,-1)); /* 100, 99, ..., 1 */
create EigMeans var {"rank" "Mean"};
append;
close EigMeans;
quit;
 
proc univariate data=EigMeans;
where rank<&size;
var Mean;
histogram Mean;
run;
Post a Comment

How to read data set variables into SAS/IML vectors

One of the first skills that a beginning SAS/IML programmer learns is how to read data from a SAS data set into SAS/IML vectors. (Alternatively, you can read data into a matrix). The beginner is sometimes confused about the syntax of the READ statement: do you specify the names of the variable in the data set, or the names of the SAS/IML vectors that you are trying to create?

The answer is "yes." J By default, SAS/IML creates vectors that have the same name as the data set variables that you specify on the READ statement. For example, if you want to read variables from the Sashelp.Class data set, use the VAR clause on the READ statement to specify the variable names, like so:

proc iml;
use Sashelp.Class;
read all var {sex height weight};

This code snippet creates three vectors (Sex, Height, and Weight) that each contain 19 rows, which is the number of observations in the data set. However, the question that I've been asked is "What exactly is that thingy between the curly braces?" Is it a list of vector names? Is it something else?

How the SAS/IML language treats character arrays

The answer is "It is a character vector that specifies the names of variables in the data set." The confusion arises because the "thingy between curly braces" doesn't have any quotation marks. However, in the SAS/IML language, characters inside of curly braces are transformed to upper-case strings. In other words, the following two statements are equivalent:

c = {sex height weight};    /* converted to upper case strings */
c = {"SEX" "HEIGHT" "WEIGHT"};

The first statement does not contain quotation marks, but the parser recognizes that it is an array of character values. Therefore, each character string is converted to upper case before it is stored in the vector c.

Specifying variable names

Because SAS variables are not case-sensitive, SAS software doesn't care how you specify the names of data set variables. Upper case, lower case, mixed case,...it's all the same to SAS. Consequently, the original SAS/IML statements are equivalent to the following statements:

read all var {"SEX" "Height" "WeIgT"}; /* names are not case sensitive */

In either case, the READ statement creates three vectors that have the same names as specified on the VAR clause.

Since all specifications read the same variables, you might wonder what label appears when the PRINT statement is used to display a vector. Upper case? Mixed case? The answer is that the PRINT statement uses the same case as the name of the variable in the data set. For this example, the variables in the Sashelp.Class data are mixed case with a leading upper-case letter, as shown in the output of the following PRINT statement:

print sex height weight; /* print in same case as in data set */

For more details on how to read data into SAS/IML variables, see my article "Reading SAS data sets."

Post a Comment

The power method: compute only the largest eigenvalue of a matrix

When I was at SAS Global Forum last week, a SAS user asked my advice regarding a SAS/IML program that he wrote. One step of the program was taking too long to run and he wondered if I could suggest a way to speed it up. The long-running step was a function that finds the largest eigenvalue (and associated eigenvector) for a matrix that has thousands of rows and columns. He was using the EIGEN subroutine, which computes all eigenvalues and eigenvectors—even though he was only interested in the eigenvalue with the largest magnitude.

I asked some questions about his matrix and discovered that it had some important properties:

I told him that the power iteration method is an algorithm that can quickly compute the largest eigenvalue (in absolute value) and associated eigenvector for any matrix, provided that the largest eigenvalue is real and distinct. Distinct eigenvalues are a generic property of the spectrum of a symmetric matrix, so, almost surely, the eigenvalues of his matrix are both real and distinct.

The power iteration method requires that you repeatedly multiply a candidate eigenvector, v, by the matrix and then renormalize the image to have unit norm. If you repeat this process many times, the iterates approach the largest eigendirection for almost every choice of the vector v. You can use that fact to find the eigenvalue and eigenvector.

The power iteration method when the dominant eigenvalue is positive

The power method produces the eigenvalue of the largest magnitude (called the dominant eigenvalue) and its associated eigenvector provided that

  1. The magnitude of the largest eigenvalue is greater than that of any other eigenvalue. That is, the largest eigenvalue is not complex and is not repeated.
  2. The initial guess for the algorithm is not an eigenvector for a non-dominant eigenvalue and is not orthogonal to the dominant eigendirection.

It is easy to implement a SAS/IML module that implements the power iteration method for a matrix whose dominant eigenvalue is positive. You can generate a random vector to serve as an initial value for v, or you can use a fixed vector such as a vector of ones. In either case, you form the image A v, normalize that value, and repeat until convergence. This is implemented in the following function:

proc iml;
/* If the power method converges, the function returns the largest eigenvalue. 
   The associated eigenvector is returned in the first argument, v.
 
   If the power method does not converge, the function returns a missing value.
 
   The arguments are:
    v    Upon input, contains an initial guess for the eigenvector. 
         Upon return it contains an approximation to the eigenvector.
    A    The matrix whose largest eigenvalue is desired. 
    maxIters The maximum number of iterations.
 
   This implementation assume that the largest eigenvalue is positive.
*/
start PowerMethod(v, A, maxIters);
   /* specify relative tolerance used for convergence */
   tolerance = 1e-6; 
   v = v / sqrt( v[##] );  /* normalize */
   iteration = 0; lambdaOld = 0;
 
   do while ( iteration <= maxIters);
      z = A*v;                /* transform */
      v = z / sqrt( z[##] );  /* normalize */
      lambda = v` * z;
      iteration = iteration + 1;
      if abs((lambda - lambdaOld)/lambda) < tolerance then
         return ( lambda );
      lambdaOld = lambda;
   end;
   return ( . ); /* no convergence */
finish;
 
/* test on small example */
A = {-261 209  -49,
    -530 422  -98,
    -800 631 -144 };
v = {1,2,3}; /* guess */
lambda = PowerMethod(v, A, 40 );
if lambda^=. then do; 
   /* check that result is correct */
   z = (A - lambda*I(nrow(A))) * v; /* test if v is eigenvector for lambda */
   normZ = sqrt( z[##] ); /* || z || should be ~ 0 */
   print lambda normZ; 
end;
else print "Power method did not converge";

Efficiency of the power iteration method

Finding the complete set of eigenvalues and eigenvectors for a dense symmetric matrix is computationally expensive. Multiplying a matrix and a vector is, in comparison, a trivial computation. I know that the power method will be much, much faster, than computing the full eigenstructure, but I'd like to know how much faster. Let's say I have a moderate-sized symmetric matrix. About how much faster is the power method over computing all eigenvectors? To be definite, I'll compare the times for symmetric matrices that have up to 2,500 rows and columns.

I previously have blogged about how to compare the performance of algorithms for solving linear systems. I will use the same technique to compare the performance of the PowerMethod function and the EIGEN subroutine. The following loop constructs a random symmetric matrix for a range of matrix sizes. For each matrix, the program times how long it takes the PowerMethod and EIGEN routines to run:

/***********************************/
/* large random symmetric matrices */
/***********************************/
sizes = do(500, 2500, 250); /* 500, 1000, ..., 2500 */
results=j(ncol(sizes), 3);  /* allocate room for results */
call randseed(12345); 
do i = 1 to ncol(sizes);
   n = sizes[i];
   results[i,1] = n;  /* save size of matrix */
   r = j(n*(n+1)/2, 1);
   call randgen(r, "uniform");
   r = sqrvech(r); /* make symmetric */
 
   q = j(n,1,1);
   t0=time();
   lambda = PowerMethod(q, r, 1000 );
   results[i,2] = time()-t0; /* time for power method */
 
   t0=time();
   call eigen(evals, evects, r);
   results[i,3] = time()-t0; /* time for all eigenvals */
end;
labl = {"Size" "PowerT" "EigenT"};
print results[c=labl];

The results are pretty spectacular. The power method algorithm is virtually instantaneous, even for large matrices. In comparison, the EIGEN computation is a polynomial-time algorithm in the size of the matrix. You can graph the timing by writing the times to a data set and using the SGPLOT procedure:

create eigen from results[c=labl];
append from results;
close;
 
proc sgplot data=eigen;
series x=Size y=EigenT / legendlabel="All Eigenvalues";
series x=Size y=PowerT / legendlabel="Largest Eigenvalue";
yaxis grid label="Time to Compute Eigenvalues";
xaxis grid label="Size of Matrix";
run;

In the interest of full disclosure, the power method converges at a rate that is equal to the ratio of the two largest eigenvalues, so it might take a while to converge if you are unlucky. However, for large matrices the power method should still be much, much, faster than using the EIGEN routine to compute all eigenvalues. The conclusion is clear: The power method wins this race, hands down!

What if the dominant eigenvalue is negative?

If the dominant eigenvalue is negative and v is it's eigenvector, v and A*v point in opposite directions. In this case, the PowerMethod function needs a slight modification to return the dominant eigenvalue with the correct sign. There are two ways to do this. The simplest is to compute z = A*(A*v) until the algorithm converges, and then compute the eigenvalue for A in a separate step. An alternative approach is to modify the normalization of v so that it always lies in a particular half-space. This can be accomplished by choosing the direction (v or -v) that has the greatest correlation with the initial guess for v. I leave this modification to the interested reader.

And to my mathematical friends: did you notice that I used random symmetric matrices when timing the algorithm? Experimentation shows that the dominant eigenvalue for these matrices are always positive. Can anyone point me to a proof that this is always true? Furthermore, the dominant eigenvalue is approximately equal to n/2, where n is the size of the matrix. What is the expected value and distribution of the eigenvalues for these matrices?

Post a Comment

Checking your answers: Are computed values close to the true values?

In statistical programming, I often test a program by running it on a problem for which I know the correct answer. I often use a single expression to compute the maximum value of the absolute difference between the vectors:

maxDiff = max( abs( z-correct ) ); /* largest absolute difference */

In this expression, z is the vector that I have computed and correct is the correct answer.

Let's break this expression down into pieces:

  • (z - correct) is the difference between the two vectors. They must have the same number of elements in order for this expression to make sense. The result is the elementwise difference. That is, the expression resolves to a temporary vector (let's call it diff) that is the same length as z and the ith element is equal to z[i] - correct[i].
  • The ABS function returns a vector that contains the absolute values of each element of its argument.
  • The MAX function returns a scalar value that is the maximum value of the elements of its argument.
Consequently, the expression returns a scalar value that is the largest absolute difference between the vector z and the vector correct.

For example, last week I showed how you can use the DIF function to compute simple finite-difference approximations to derivatives. In that article, I computed an approximate derivative to the sine function and compared it to the true derivative, as follows:

proc iml;
h = 0.1;
x = T( do(0, 6.28, h) );     /* x in [0, 2 pi] */
y = sin(x);
approx =  dif(y, 1) / h;     /* f'(x) ~ (f(x)-f(x-h))/h */
correct = cos(x);            /* true derivatives at x */
maxBDiff = max(abs(approx - correct)); /* find maximum difference */
print maxBDiff;

The output tells me that the approximate derivative and the true value differ by about 0.05 for some value of the x vector.

It is interesting to note that you can use the exact same expression if correct is a scalar value. In this case, you are computing the maximum absolute deviation between a vector of values and a target value.

Other ways to compute differences and deviations

In statistics, a difference between two values is called a deviation, especially when one expression is an estimate and another is an expected value. In the language of statistics, the expression in the previous section is similar to the maximum absolute deviation. There are other statistical concepts that you can use to measure the difference between two vectors, or between a vector and a target value:

  • Mean absolute deviation: Instead of taking the maximum, you can compute the mean of the absolute differences: (abs(z-correct))[:]. Recall that the colon (:) subscript operator is a convenient way to compute the mean of a vector in the SAS/IML language.
  • Median absolute deviation: The mean absolute deviation is sensitive to large values. A robust alternative is the median absolute deviation (MAD), which you can compute by using the built-in MAD function: mad(z-correct)
  • Sum of squared errors: You can use the SSQ function to compute the sum of squared errors between two vectors: ssq(z-correct).
  • Mean squared error: Compute the mean of the squared deviations. Define diff = z-correct. Then ssq(diff)/countn(diff) is the mean squared error. I use the COUNTN function rather than the NCOL function in case there are missing values in the diff vector.
  • Root mean squared error: The square root of the previous quantity is the root mean squared error.

There are other measures that you can use (such as relative quantities), but these are some common ways to compute a measure of how much one vector of values differs from another.

Post a Comment

Expand data by using frequencies

A reader asked:

I want to create a vector as follows. Suppose there are two given vectors x=[A B C] and f=[1 2 3]. Here f indicates the frequency vector. I hope to generate a vector c=[A B B C C C]. I am trying to use the REPEAT function in the SAS/IML, language but there is always something wrong. Can you help me?

This is probably a good time to remind everyone about the SAS/IML Community (formerly known as a Discussion Forum). You can post your SAS/IML questions there 24 hours a day. That is always a better plan than making a personal appeal to me, because I receive dozens of questions like this every month, and there is no way that I can personally reply. There are lots of experienced SAS/IML experts out there, so please use the SAS/IML Community to tap into that knowledge.

That said, I think the answer to this reader's question makes an interesting example of statistical programming with SAS/IML software. It is trivial to solve this in the DATA step (see the end of this article), but how might you solve it in the SAS/IML language? If you'd like to try to solve this problem yourself, stop reading here. Spoilers ahead!

Create a new vector that duplicates frequencies

The goal is to write a function that duplicates or "expands" data that have a frequency variable. The important function to use for this task is the CUSUM function, which computes the cumulative frequencies. Let's look at a simple example and apply the CUSUM function to the frequency vector:

proc iml;
values={A,B,C,E};
freq = {2,1,3,4};
cumfreq = cusum(freq);
print values freq cumfreq;

As shown in the output, the cumfreq variable contains the indices for the expanded data. The expanded data will be a vector that contains 10 elements. The first data value (A) repeats twice (the freq value), so it repeats until element 2 (the cumfreq value) in the expanded vector. The second category fills element 3. The next category repeats 3 times, so it occupies up through element 6 in the expanded vector. The last category repeats until element 10. The following DO loop specifies each data value and the indices of the expanded vector that it should occupy:

print (values[1])[label="value"] (1:cumFreq[1])[label="Indices"];
do i = 2 to nrow(values);
   bIdx = 1 + cumFreq[i-1]; /* begin index */
   eIdx = cumFreq[i];       /* end index */
   value = values[i];
   print value (bIdx:eIdx)[label="Indices"];
end;

The output shows that we have all the information we need to allocate a vector of length 10 and fill it with the data values, where the ith value is repeated freq[i] times. The key, it turns out, is to use the CUSUM function to find the indices that correspond to the each data value.

A module to compute the expanded data

In SAS procedures that support a FREQ statement, the frequency values must be positive integers. If the frequency value is missing or is a nonpositive value, the corresponding data value is excluded from the analysis. It is easy to add that same feature to a module that takes a vector of values and a vector of frequencies and returns a vector that contains the data in expanded form. This is implemented in the following SAS/IML module, which allocates the result vector with the first data value in order to avoid handling the first element outside of the DO loop:

start expandFreq(_x, _freq);
   /* Optional: handle nonpositive and fractional frequencies */
   idx = loc(_freq > 0); /* trick: in SAS this also handles missing alues */
   if ncol(idx)=0 then return (.);
   x = _x[idx];
   freq = round( _freq[idx] );
 
   /* all frequencies are now positive integers */
   cumfreq = cusum(freq);
 
   /* Initialize result with x[1] to get correct char/num type */
   N = nrow(x);
   expand = j(cumfreq[N], 1, x[1]); /* useful trick */
 
   do i = 2 to N;
      bIdx = 1 + cumFreq[i-1]; /* begin index */
      eIdx = cumFreq[i];       /* end index */
      expand[bIdx:eIdx] = x[i];/* you could use the REPEAT function here */
   end;
   return ( expand );
finish;
 
/* test the module */
values={A,B,C,D,E,F};
freq = {2,1,3,0,4,.}; /* include nonpositive and missing frequencies */
y = expandFreq(values, freq);
print values freq y;

Notice that you don't actually need to use the REPEAT function because SAS/IML is happy to assign a scalar value into a vector. The scalar is automatically repeated as often as needed in order to fill the vector.

A DATA step solution

As indicated at the beginning of this post, the DATA step solution is quite simple: merely use the OUTPUT statement in a loop, as shown in the following example:

data Orig;
input x $ Freq;
datalines;
A 2
B 1
C 3
D 0
E 4
F .
;
run;
 
/* expand original data by frequency variable */
data Expand;
keep x;
set Orig;
if Freq<1 then delete;
do i = 1 to int(Freq);
   output;
end;
run;
proc print data=Expand; run;

The output data set contains the same data as the y vector in the SAS/IML program.

Post a Comment

The DIF function: Compute lagged differences and finite differences

To a statistician, the DIF function (which was introduced in SAS/IML 9.22) is useful for time series analysis. To a numerical analyst and a statistical programmer, the function has many other uses, including computing finite differences.

The DIF function computes the difference between the original vector and a shifted version of that vector. In terms of the LAG function, DIF(x,k) = x - LAG(x,k) for any value of the lag parameter, k. I blogged about the usefulness of the LAG function earlier this week.

Using the DIF function to compute forward and backward differences

For a function that is given by a formula, you can use the NLPFDD subroutine to compute finite difference derivatives. However, sometimes a function is known only at a finite set of points. In that case, you have a choice: you can either model the function by using regression techniques or you can assume that the function is piecewise linear.

Some curves really are piecewise linear. For example, an ROC curve is piecewise linear, and you can compute the exact derivatives by using a forward or backward difference scheme. You can also compute an exact area under the piecewise linear function by using the trapezoidal rule of integration.

The DIF function makes it easy to compute lagged differences (finite differences) in a sequence of values. As an example, the derivative of a function can be approximated by the backward difference formula: f'(x)(f(x)-f(x-h))/h for small values of h. If you know the values of f at a discrete set of points x1 < x2 < ... < xn, then you can use the DIF function to evaluate the backward difference because the expression f(xi-h) is the lagged term f(xi-1). For example, the following SAS/IML program computes a sequence of evenly spaced x values and evaluates the sine function at these points. The points of the backDiff vector approximate the derivative of the sine at each value of x:

proc iml;
h = 0.1;
x = T( do(0, 6.28, h) );
y = sin(x);
backDiff =  dif(y, 1) / h;     /* f'(x) ~ (f(x)-f(x-h))/h */

When the DIF function is called with a single argument, a lag of 1 is assumed, so you can also write backDiff = dif(y)/h.

We know from calculus that the exact derivative of the sine function is the cosine. The following function computes the exact derivative at each value of x and compares it with the finite difference approximation:

deriv = cos(x);
maxBDiff = max(abs(deriv-backDiff)); /* find maximum difference */
print maxBDiff;

The following plot shows the exact derivative and the backward difference approximation at each point of x:

The finite difference approximations are in close agreement with the exact values. You can also plot the forward difference approximation, which is similar. The forward difference requires using a shift value of -1. When you work through the formula, you find that forwardDiff = -dif(y, -1) / h.

Finite differences for irregularly spaced data

The DIF function also "works" on irregularly spaced data. For data that are not evenly spaced, the h parameter, which is the difference between adjacent x values, is no longer constant. You can use the DIF function to compute the distance between x values, and then compute the slopes as shown in the following statements:

/* irregular spacing and no formula */
x = {0.0, 0.1, 0.2, 0.4, 0.5, 0.8, 1.0};
y = {0.3, 0.6, 0.7, 0.7, 0.9, 1.0, 1.0};
dx = dif(x);  /* difference for adjacent x values (lag=1) */
dy = dif(y);  /* difference for adjacent y values (lag=1) */
slopes = dy/dx;
print dx dy slopes;

You can also use the DIF and LAG function to implement integration schemes. For example, in my article on the trapezoidal rule of integration, I could have implemented the trapezoidal rule by using LAG and DIF instead of using indexes to form the lag of the data vectors manually.

Post a Comment

The LAG function: Useful for more than time series analysis

To a statistician, the LAG function (which was introduced in SAS/IML 9.22) is useful for time series analysis. To a numerical analyst and a statistical programmer, the function provides a convenient way to compute quantitites that involve adjacent values in any vector.

The LAG function is essentially a "shift operator." It shifts a vector of values and pads the result with missing values so that the returned vector has the same number of elements as the original vector. For example, the following SAS/IML statements define the first few terms of the Fibonacci series and call the LAG function to shift the series by one element.

proc iml;
v = {1, 1, 2, 3, 5, 8, 13, 21}; /* Fibonacci sequence */
lag1 = lag(v);            /* by default, lag=1 ==> shift forward */
first = 1:(nrow(v)-1);    /* index 1:(N-1) */
v1 = v[first];            /* extract all but the last element */
print lag1 v1;

The returned vector, lag1, contains a missing value in the first element and does not contains the last element of v. Notice that the nonmissing values are similar to v1, which is obtained by subsetting the first N-1 elements of the vector v.

You can shift elements the other way by using a negative value for the lag parameter. (This is sometimes called computing a lead.)

lag2 = lag(v, -1);       /* shift backward */
last = 2:nrow(v);        /* index 2:N */
v2 = v[last];            /* extract all but first element */

The returned vector (not shown) contains a missing value in the last element and does not contains the first element of v.

The LAG function is valuable when you want to compute a quantity that involves adjacent elements. For example, the following statements compute the ratio of adjacent values in the Fibonacci sequence:

z = v/lag(v); /* ratio of adjacent values */
print z;

This ratio quickly converges to the Golden Ratio, which is which is 1.61803399.... In a previous post, I show how you can undestand this result by looking at the eigenvalues of a certain linear transformation.

So, yes, by all means, use the LAG function to compute lags and leads in time series data. However, the LAG functon is also useful for any numerical computation that involves adjacent values in a sequence.

Post a Comment

Popular! Articles that strike a chord with SAS users

Next week I will be in Orlando for SAS Global Forum 2012. If you will be there too, I hope you will try to find me (I've posted my schedule!) to discuss statistical programming with SAS software.

If you aren't going to Orlando, why don't you write a comment to this post and tell me what tools or features you need in SAS software in order to work more efficiently and accomplish more?

After SAS Global Forum I am taking some days off for vacation. While I am on a blogging hiatus, I hope you will enjoy these links to some previous articles.

I'll see you in May!

Popular articles from The DO Loop

I blog about a lot of topics, but the following five categories represent some of my favorite subjects. Judging by the number of readers and comments, these articles have struck a chord with SAS users. If you haven't read them, check them out. (If you HAVE read them, some are worth re-reading!)

  1. SIMULATION: If you run simulations in SAS, you had better understand the four essential functions for statistical programmers. Simulation depends on random samples, so it is good to know how to generate random numbers in SAS. Lastly, it is important to understand random number streams in SAS and how they work.
  2. MATRIX COMPUTATIONS: In the article "Solving linear systems: Which technique is fastest?" I show that solving a specific linear system is about four times faster than solving for a general inverse. This article also inspired the popular article, "What is the chance that a random matrix is singular?"
  3. STATISTICAL PROGRAMMING: You can't program something if you don't understand it. In the article "What is Mahalanobis distance?" I describe the geometry of Mahalanobis distance, which provides a way to measure distances that takes into account correlations in the data. This article is linked to from Wikipedia because it is an "intuitive illustrated explanation." I've also written several other articles related to multivariate statistics:
  4. SAS PROGRAMMING: When you run SAS 9.3 in the windowing environment, HTML is the default output destination. The article "How to clear the output window in SAS 9.3" describes how to clear the tables and graphics in the HTML destination. If you haven't yet adopted SAS 9.3, read the top five features of SAS 9.3 that every SAS user will love.
  5. STATISTICAL THINKING: The internet is great for a lot of things, but not for estimating the relative popularity of topics. In the article "Estimating popularity based on Google searches: Why it's a bad idea" I argue that an estimate of popularity that is based on internet searches is a biased estimate. Statisticians at Google don't make this mistake and neither should you.
Post a Comment

Extending SAS: How to define new functions in PROC FCMP and SAS/IML software

SAS software provides many run-time functions that you can call from your SAS/IML or DATA step programs. The SAS/IML language has several hundred built-in statistical functions, and Base SAS software contains hundreds more. However, it is common for statistical programmers to extend the run-time library to include special user-defined functions.

In a previous blog post I discussed two different ways to apply a log transformation when your data might contain missing values and negative values. I'll use the log transformation example to show how to define and call user-defined functions in SAS/IML software and in Base SAS software.

A "safe" log transformation in the SAS/IML language

In the SAS/IML language, it is easy to write user-defined functions (called modules) that extend the functionality of the language. If you need a function that safely takes the natural logarithm and handles missing and negative values, you can easily use the ideas from my previous blog post to create the following SAS/IML function:

proc iml;
/* if Y>0, return the natural log of Y
   otherwise return a missing value  */
start SafeLog(Y);
   logY = j(nrow(Y),ncol(Y),.); /* allocate missing */
   idx = loc(Y > 0);            /* find indices where Y > 0 */
   if ncol(idx) > 0 then logY[idx] = log(Y[idx]);
   return(logY);
finish;
 
Y = {-3,1,2,.,5,10,100}; 
LogY = SafeLog(Y);
print Y LogY;

The program is explained in my previous post, but essentially it allocates a vector of missing values and then computes the logarithm for the positive data values. The START and FINISH statements are used to define the SafeLog function, which you can then call on a vector or matrix of values.

In this example, the function is defined only for the current PROC IML session. However, you can store the function and load it later if you want to reuse it.

Defining a "safe" log transformation by using PROC FCMP

You can also extend the Base SAS library of run-time functions. The FCMP procedure enables you to define your own functions that can be called from the DATA step and from other SAS procedures. (The MCMC procedure has an example of calling a user-defined function from a SAS/STAT procedure.) If you have never used the FCMP procedure before, I recommend Peter Eberhardt's 2009 paper on defining functions in PROC FCMP. For a more comprehensive treatment, see Jason Secosky's 2007 paper.

Technically, you don't need to do anything special in the DATA step if you want a SAS missing value to represent the logarithm of a negative number: the DATA step does this automatically. However, the DATA step also generates some scary-looking notes in the SAS LOG:

NOTE: Invalid argument to function LOG at line 72 column 5.
RULE:      ----+----1----+----2----+----3----+----4----+----5----+----6----+----7----+----8----+
74         -3 1 2 . 5 10 100
x=-3 y=. _ERROR_=1 _N_=1
NOTE: Missing values were generated as a result of performing an operation on missing values.
NOTE: Mathematical operations could not be performed at the following places. The results of
      the operations have been set to missing values.

I prefer my programs to run with a clean, healthy-looking SAS LOG, so I will use PROC FCMP to define a SafeLog function that has the same behavior (and name!) as my SAS/IML function:

proc fcmp outlib=work.funcs.MathFuncs;
function SafeLog(y);
   if y>0 then return( log(y) );
   return( . );
endsub;
quit;

The function returns a missing value for nonpositive and missing values. The definition of the function is stored in a data set named WORK.FUNCS, which will vanish when you exit SAS. However, you can create the definition in a permanent location if you want to call the function in a later SAS session.

In order to call the function from the DATA step, use the CMPLIB= option, as shown in the following example:

options cmplib=work.funcs;  /* define location of SafeLog function */
data A;
input x @@;
y = SafeLog(x); /* test the function */
datalines;
-3 1 2 . 5 10 100 
;
run;

The result is not shown, but it is identical to the output from the SAS/IML program.

You might not have need for the SafeLog function, but it is very useful to know how to define user-defined functions in SAS/IML software and in Base SAS software. SAS/IML modules and PROC FCMP functions make it easy to extend the built-in functionality of SAS software.

Post a Comment