Read from one data set and write to another with SAS/IML


Many people know that the SAS/IML language enables you to read data from and write results to multiple SAS data sets. When you open a new data set, it is a good programming practice to close the previous data set. But did you know that you can have two data sets open simultaneously, one for reading and one for writing? The trick is to use the SETIN and SETOUT statements to specify the data sets that are used for reading and writing. Among other advantages, this technique enables you to use SAS/IML to analyze data that might be larger than can fit into the RAM on your computer.

This article is motivated by a question posted to the SAS/IML Support Community. The SAS/IML programmer was essentially running a BY-group analysis in SAS/IML. The user wanted to use the WHERE clause on the READ statement to read the data for each BY-group.

To illustrate the technique, consider the following BY-group regression analysis of the vehicles in the Sashelp.Cars data set. For each value of the Make variable (Acura, Audi, BMW,...), PROC REG performs an ordinary least-squares (OLS) analysis that uses the Wheelbase and Length variables to model the Weight of a vehicle. The parameter estimates are saved to the OutEst data set:

proc reg noprint outest=OutEst;
   by Make;
   model Weight = Wheelbase Length;

Let's create a similar analysis in the SAS/IML language. Although SAS/IML is often used to perform analyses that are not available from any SAS procedure, this simple and familiar analysis will enable us to concentrate on how to read and write data sets simultaneously. The following SAS/IML statements define a module that computes the parameter estimates for an OLS model. Because some of the BY groups result in singular linear systems, the GINV function is used to solve for the parameter estimates, rather than the more efficient SOLVE function.

proc iml;
start OLSParamEst(Y, _X);
   X = j(nrow(_X),1,1) || _X;   /* add intercept column to design matrix */
   XPX = X` * X;                /* cross-products      */
   /* handle singular systems by using GINV rather than SOLVE */
   Estimate = ginv(XPX) * X`*Y; /* parameter estimates */
   return( T(Estimate) );       /* return as row vector */

The BY-group analysis will be carried out one category at a time. The following statements open the input data set and read the BY-group categories. The output data set (which will contain the parameter estimates) is also opened for writing. Notice that you need to specify the names and types of the output variables on the CREATE statement, including the length of any character variables. Lastly, the SETIN and SETOUT statements are used to specify that the data sets that will be used for subsequent reading and writing operations:

use;                   /* input data */
read all var {"Make"};              /* read BY var */
byGroup = unique( Make );           /* compute unique levels */
ByVal = BlankStr(nleng(Make));      /* set length of ByVal variable (12.3) */
OutVarNames = {"Intercept" "Wheelbase" "Length"}; 
Estimate = {. . .};                 /* output variables are numeric */
create RegOut from Estimate[r=ByVal c=OutVarNames];
setin Sashelp.Cars;                 /* make current for reading */
setout RegOut;                      /* make current for writing */

The SETIN statement means "every time you want to read, use the Sashelp.Cars data set." The SETOUT statement means "every time you want to write, use the RegOut data set." The following statements loop over the unique categories of the Make variable, and perform an OLS analysis for the vehicles that have a common make. You can use the INDEX statement to speed up reading the BY groups.

InVarNames = {"Weight" "Wheelbase" "Length"};    /* input variables */
index Make;                           /* index by BY group variable */
do i = 1 to ncol(byGroup);
   ByVal = byGroup[i];
   read all var InVarNames where(Make=(ByVal));  /* read from Cars */ 
   Estimate = OLSParamEst(Weight, Wheelbase||Length);
   append from Estimate[r=ByVal];                /* write to RegOut */
close Sashelp.Cars RegOut;

The program performs 38 parameter estimates, one for each unique value of the Make variable. Because these data will fit entirely into RAM, you could also perform this analysis by using the UNIQUE-LOC or UNIQUEBY techniques.

In summary, use the SETIN and SETOUT statements when you want to have two SAS data sets open simultaneously, one for reading and one for writing. By the way, you can also use the SETOUT statement to write to several data sets, such as writing predicted and residual values to one while writing parameter estimates to another.


About Author

Rick Wicklin

Distinguished Researcher in Computational Statistics

Rick Wicklin, PhD, is a distinguished researcher in computational statistics at SAS and is a principal developer of SAS/IML software. His areas of expertise include computational statistics, simulation, statistical graphics, and modern methods in statistical data analysis. Rick is author of the books Statistical Programming with SAS/IML Software and Simulating Data with SAS.

Leave A Reply

Back to Top