How to evaluate SAS expressions in DATA Step dynamically

6

In many programming languages, there is a function named eval() that can be used to evaluate an expression and return the result at run time. For example, in Python, the eval() function parses the expression passed to it and runs a Python expression or code within the program. Even Python eval() supports more optional parameters, such as global/local dictionary for runtime context, but the goal for evaluation remains the same.

When an expression is stored as a character variable of an SAS observation, it suggests the running context is the SAS program execution phase, so the internal data structure like Input Buffer (IB), Program Data Vectors (PDV) and Descriptor Information of the output database are all accessible. Furthermore, all SAS system functions are also accessible in an expression, just like the raw SAS code exposed to SAS compiler and runtime. In the following SAS data set, what the user wants is to get the real value from the expression in variable C. It is different from the commonly used Calculated Column, which uses the same computing rule to expand a new variable for all observations. Here the column C has a different computing rule for each observation; the user's expected result is the column D.

%let MVAR=2;
data a; 
  length a b 8 c $ 255;
  a=3; b=4; c='a**2+b**2'; output; /* Arithmetic or Logical */  
  a=7; b=22; c='b/a * &MVAR'; output; /* SAS Macro Var*/
  a=113; b=355; c='cos(b/a)'; output; /* Trigonometric */
  a=0; b=1; c='cdf("NORMAL", 0, a, b)'; output; /* Probability */
run;
proc print;run;

What solutions ahead?

Someone might want a solution that parses the expression in variable C, and then try to rebuild the abstract syntax tree and try to evaluate it from bottom to top. I can’t say this solution is totally wrong but it’s very complex and too hard to generate a general solution. If we have an SAS function eval() in DATA Step, then we can easily use the following code to achieve the goal. Unfortunately, SAS did not provide the eval() function.

 
data b; 
  set a;
  d=eval(c);
run;

SAS provides Macro function %eval, %sysevalf to evaluate arithmetic and logical expressions using integer or floating-point arithmetic. SAS also provides function resolve() (not supported in the DATA Step that runs in CAS) to return the resolved value of the argument after it has been processed by the macro facility. Anyway, resolve() can’t access the values of macro variable assigned by symput() or symputn() at program execution phase. So, for the SAS code as below, it outputs c=150 and d=200. For simple arithmetic and logical expression, we can use the tranwrd() function to replace all variable names with real values, and then use resolve('%sysevalf('|| cats(expression_with_real_values) || ')') to get evaluated result, but it limited to only SAS functions supported by %eval() and %sysevalf() macro functions.

%let a=100;
%let b=50;
data _null_;
  call symput("b", 200);  
  c=resolve("%eval(&a+&b)");
  put c=;
  d=symget("b");
  put d=; 
run;

Now let’s return to the key question we raised: How can we implement the eval() function in SAS Data Step? The best solution we found so far is to use the dynamic code generation mechanism in SAS, and then submit the code to SAS for parsing and execution. In this way, we retrieve string expression from a variable to a real expression in SAS code successfully. We don’t care what the valid SAS expression is, we totally convert it to the code snippets and submit it for execution. Syntax check, reference to PDV variables, internal functions, and even SAS Macro variables are all supported. Yes, it’s a smart and concise implementation for general purposes.

Due to each observation having a different computing rule, we need to use _N_ to control the observation mapping. So, for the existing dataset A, we can use the following code to achieve the goal. The key secret is to use CALL EXECUTE to generate SAS code dynamically and delay execution after the code is ready. In the output dataset B, we have a numeric column D with the value evaluated with the character expression from column C. You can see the output on the RIGHT part of the Figure 1.

data _null_;
  set a end=last;
  if _n_=1 then call execute("data b; set a;");
  call execute( "if _N_=" || compress(_N_) || " then d=" || trim(c) || "; ");
  if last then call execute("run;");
run;
proc print; run;

Wrap up as a reusable SAS macro

We also can wrap up the above SAS code in a reusable SAS macro, so this logic can be used anywhere you want. The %evalvar macro has four arguments: ds= and var= are the input dataset and columns with expression to be evaluated. And outds= and outvar= are the output dataset and columns to store result. We also can specify same value for ds= and outds=, so the user just expands the existing dataset with an additional evaluated column.

%macro EvalVar(ds=, var=, outds=, outvar=);
data _null_;
  set &ds end=last;
  if _n_=1 then call execute("data &outds; set &ds;");
  call execute( "if _n_=" || compress(_n_) || " then &outvar=" || &var || ";");
  if last then call execute("run;");
 run;
%mend; 
 
%EvalVar(ds=a, var=c, outds=b, outvar=d);
proc print; run;

By the way, I also had a temporary SAS code generation solution implemented via the %include SAS macro. The temporary SAS code will be deleted automatically when the SAS session is closed. The sample code is also attached here for your information.

filename tmpfile temp;
data _null_;
  file tmpfile;
  set a end=last;  
  if _N_=1 then put "data b; set a;";
  put "if _N_ =" _N_ "then d=" c ";";
  if last then 	put "run;";
run;
%include tmpfile;
proc print; run;

Summary

In this article, we talk about how to evaluate SAS expressions in Data Step dynamically. The expression parsing and execution are totally handled by SAS at program execution phase. It avoids handling abstract syntax tree parsing and evaluation ourselves on it. We introduce two dynamic code generation implementations via call execute or %include. We also use _N_ to control observation mapping due to each observation has different computing rules. This implementation can reflect the beauty of simplicity in SAS before SAS provides system function eval() one day.

Share

About Author

Yinliang Wu

Director of Research and Development at SAS

Yinliang Wu has over 21 years software industry management and business experience. He focuses on data science, parallel computing and visualization such as AI, BI, big data, data visualization, quantitative trading, web crawler etc. Currently he is working on SAS Visual Analytics product family research and development. He received his Master of Science degree from Peking University. He is author of the book SAS 技术内幕 (2018) [Insider of SAS Technology: From Programmer to Data Scientist] and co-author of the book "数据分析实用教程 (2021) [Data Analysis Practical Tutorial] ".

6 Comments

  1. Peter Lancashire on

    Executing literal code without checking it leaves you open to a code injection attack. Use this method only if you trust the source of the data.

    Consider what would happen if an attacker provided "0; run; x 'something-bad';" and you had not prevented use of the X statement.

    In general mixing code and data in this way will give you not only a security but also a maintenance headache.

    • Yinliang Wu

      Thank you, Peter. In general, executing literal code always has code injection risk, this is why software providers empahsize security more and more, and try to seperate boundary of data and code. Beside X command, SAS also provide a lot of functions/routines under 'Special' and "File I/O" category, which may do something bad to system. However, on the other side, business demand is complicated and sometimes users have to find the balance between security and flexibity/extensibility, platform/language provider has to find a balance and try to satisfy users for this logic. In fact, almost all eval() function in different languages have the same risk, even SQL has the famouse SQL Injection, users have the responsibity to ensure data and code are both safe enough, and this logic only be used in quite limited area such as dynamic evaluation. Users must knows the data and code well and adopt this method prudently, adding a token check logic would be good. Here I attached a quick example code:

      proc fcmp outlib=work.func.safe;
        function checksafe(exp $);
          length token $ 40;
      	index=0;
      	do until(token=' ');
      	  index=index+1;
      	  token=scan(exp, index);    
      	  if lowcase(token) in ("x" "system" ) then return(0);					
      	end;
          return(1);
        endsub;
      quit;
      option cmplib=work.func;

      Then add "if checksafe(c)=1 then " before the "call execute(...)", we can simply exclude the x command and system function/call routine etc. If an observation contains predefined danger tokens, execution logic is going to be excluded.

      data _null_;
        set a end=last;  
        if _n_=1 then call execute("data b; set a;");   
        if checksafe(c)=1 then 
          call execute( "if _N_=" || compress(_N_) || " then d=" || trim(c) || "; ");
        if last then call execute("run;");
      run;
  2. Thanks for writing about this topic. Using the CALL EXECUTE statement in the DATA step to generate code is extremely useful. Because this post describes several issues, I want to emphasize to your readers that the complexity in using CALL EXECUTE is because the generated SAS statements do not execute until AFTER the DATA step completes. That's why you must include “data B; set A;” and “run;” as part of the program.

    You can use the OPTIONS SOURCE option to see the code that is generated by the CALL EXECUTE statements. This can be useful to understand a program that is not working correctly. For example, the following DATA step is wrong, and the SOURCE option shows why:

    options source;
    data wrong;
    set A;
    call execute(catx(' ','d =', c, ';'));  /* does not work as intended */
    run;

    The SAS log shows that the generated statements are

         d = a**2+b**2 ;
         d = b/a * 2 ;
         d = cos(b/a) ;
         d = cdf("NORMAL", 0, a, b) ;

    This is not a valid SAS program. Even if you put "DATA B; set A;" before these statements and "RUN;" after these statements, you do not get the desired answer because for every observation the final statement (d=CDF(...)) is used to assign the result to d.

    This is why using the "IF _N_ =" trick is so valuable. With that trick, the correct program is implemented as follows:

    options source;
    data Result;
    set A end=eof;
    length cmd $60;
    if _n_=1 then call execute('data B; set A;');
                  cmd = catx(' ','if _n_ eq', _n_, 'then d =', c,';');
                  call execute(cmd);
    if eof   then call execute('run;');
    run;
     
    proc print data=Result; run;
    proc print data=B; run;

    The log reveals that the generated statements are now correct:

    data B; set A;
    if _n_ eq 1 then d = a**2+b**2 ;
    if _n_ eq 2 then d = b/a * 2 ;
    if _n_ eq 3 then d = cos(b/a) ;
    if _n_ eq 4 then d = cdf("NORMAL", 0, a, b) ;
    run;

    I've defined a new variable called CMD in the RESULT data set that shows the statements that will be executed when the DATA step exits. This is a useful trick.

    Again, thanks for discussing this topic. For readers who might wonder when to use this technique, one application is to write a template in the Graph Template Language (GTL) that modifies a graph that is generated by a SAS procedure.

    • Yinliang Wu

      Thanks Rick for clarification. Yes, the code is short, but more clarification for its complexity is needed.
      SAS provide CALL EXECUTE(arugment) is very helpful to extend program logic at runtime. Here argument is a string constant or variable with SAS MACRO or SAS Statement.
      However, string quoted with single quote mark ' and double quote mark " have different logic:

      If argument is quoted with double quote mark ", the argument is parsed before current data step compliation. if it is quoted with single quote mark ', the argument is parsed during program execution. Anyway, MACRO code in argument always being triggered when argument is parsed (before current DATA Step or at CALL EXECUTE statement), while non-MACRO code always be triggered after current DATA step code execution finished.

    • Yinliang Wu

      There are another 2-3 ways to do the same thing, but they are not the best practice. somebody may have concern about the IF logic, and try to find another code generation approach - Yes, we do can generate small code snippet, but it has less performance as it use a small execution context (seperate data step compilation unit), or too heavy execution context (seperate session). However, they are helpful for some specific senarios, but not for general solution. Immediate execution approach can't write back to the same dataset due to write conflict by record level control.

Leave A Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Back to Top