The most common way to read observations from a SAS data set into SAS/IML matrices is to read all of the data at once by using the ALL clause in the READ statement.
However, the READ statement also has options that do not require holding all of the observations in memory. This approach can be useful when a data set is huge and cannot fit entirely into the RAM on your computer. It is also useful when computing observation-wise statistics. For example, when you score a regression model, you can score each observation independently of the others.
There are three techniques that enable you to read data one observation at a time from a SAS data set:
- Use a DO DATA loop in conjunction with a READ NEXT statement.
- Use a DO DATA loop with a READ CURRENT statement.
- Use a DO DATA loop with a READ POINT statement.
The first two techniques are discussed in this article. The third technique will be covered in a future article.
Reading All Observations: The READ ALL Statement
SAS/IML programmers often read data for all observations at once. For example, the following statements define a SAS data set and use the READ ALL statement to read all of the data into a SAS/IML matrix, m:
/** each row defines a matrix A and a vector x **/ data inData; input A11 A12 A21 A22 x1 x2; datalines; 1 2 3 4 0 0 1 3 2 1 1 2 2 3 1 2 1 1 3 3 2 1 2 1 ; proc iml; /** read all observations into 6x4 matrix **/ varNames = {A11 A12 A21 A22 x1 x2}; use inData; read all var varNames into m; close inData; |
To access ith observation in the inData data set, use the ith row of the matrix: m[i,]. You can loop over the rows to compute some statistic for each row.
Reading One Observation after Another: The READ NEXT Statement
An alternative approach is to read the data one observation at a time—an approach that is familiar to SAS DATA step programmers! The DO DATA statement enables you to read one observation at a time until the end-of-file (EOF). The basic approach is shown in the following statements:
use inData; do data; read next var varNames into Obs; /** compute with this row... **/ end; close inData; |
The READ NEXT statement increments an "observation pointer" so that what was the "next" observation is now the "current" observation. That observation is then read into the Obs vector. The DO DATA loop continues as long as there are unread observations in the data set. When the last observation is read, the loop finishes when it reaches the END statement.
To give a concrete example, each row of the inData data set contains a 2x2 matrix (stored rowwise in the variables A11, A12, A21, and A22) and a 1x2 vector (stored in X1 and X2). The following statements read each row of the data, compute the vector y = x*A, and write the results to the outData data set:
use inData; y = {. .}; /** y is 1x2 numerical vector **/ create outData from y[colname={"y1" "y2"}]; setin inData; /** make current for reading **/ setout outData; /** make current for writing **/ do data; read next var varNames into Obs; /** from inData **/ A = Obs[ ,1:4]; x = Obs[ ,5:6]; A = shape(A, 2); /** convert to 2x2 matrix **/ y = x*A; append from y; /** to outData **/ end; close inData outData; |
Notice how the program uses the SETIN and SETOUT statements so that the READ statement reads data from inData and the APPEND statement writes data to outData. For each iteration of the DO DATA loop, the ith observation of the outData data set contains the result of the computation on the ith observation of the inData data set.
Staying on the Same Observation: The READ CURRENT Statement
The NEXT clause in the READ statement makes the next observation current, and then reads from that observation. In contrast, the CURRENT clause reads from the current observation, but does not change which observation is current. Consequently, the READ CURRENT statement is rarely used by itself, but is instead used in conjunction with other READ statements.
The READ NEXT and READ CURRENT statements can work together to read multiple variables from the same observation into different SAS/IML variables. In the previous example, all variables were read into the m matrix and the A and x matrices were created from m. In the following statements, the A and x matrices are created directly by reading relevant variables from the same observation:
use inData; do data; read next var {A11 A12 A21 A22} into A; read current var {x1 x2} into x; /** compute with this row... **/ end; close inData; |
How does this work? In the first iteration of the DO DATA loop, the "observation counter" is 0. The READ NEXT statement sets the counter to 1 and reads four variables. The READ CURRENT statement reads another two variables from that same observation. The second iteration sets the "observation counter" to 2 and reads values for that observation. This process continues until all observations are read.
Using the READ NEXT statement is called sequential access of the data: each observation is read after the previous one. There is a second way to access data, and that is called random access. Next week I will show how to use the READ POINT statement to randomly access data.
In conclusion, although it is common (and usually sufficient) to read all observations from a data set into SAS/IML matrices and vectors, you can also use the DO DATA statement to read and process one observation at a time.
2 Comments
Pingback: Random access: How to read specific observation in SAS/IML software - The DO Loop
Pingback: Reading big data in the SAS/IML language - The DO Loop