How to evaluate SAS expressions in DATA Step dynamically

10

In many programming languages, there is a function named eval() that can be used to evaluate an expression and return the result at run time. For example, in Python, the eval() function parses the expression passed to it and runs a Python expression or code within the program. Even Python eval() supports more optional parameters, such as global/local dictionary for runtime context, but the goal for evaluation remains the same.

When an expression is stored as a character variable of an SAS observation, it suggests the running context is the SAS program execution phase, so the internal data structure like Input Buffer (IB), Program Data Vectors (PDV) and Descriptor Information of the output database are all accessible. Furthermore, all SAS system functions are also accessible in an expression, just like the raw SAS code exposed to SAS compiler and runtime. In the following SAS data set, what the user wants is to get the real value from the expression in variable C. It is different from the commonly used Calculated Column, which uses the same computing rule to expand a new variable for all observations. Here the column C has a different computing rule for each observation; the user's expected result is the column D.

%let MVAR=2;
data a; 
  length a b 8 c $ 255;
  a=3; b=4; c='a**2+b**2'; output; /* Arithmetic or Logical */  
  a=7; b=22; c='b/a * &MVAR'; output; /* SAS Macro Var*/
  a=113; b=355; c='cos(b/a)'; output; /* Trigonometric */
  a=0; b=1; c='cdf("NORMAL", 0, a, b)'; output; /* Probability */
run;
proc print;run;

What solutions ahead?

Someone might want a solution that parses the expression in variable C, and then try to rebuild the abstract syntax tree and try to evaluate it from bottom to top. I can’t say this solution is totally wrong but it’s very complex and too hard to generate a general solution. If we have an SAS function eval() in DATA Step, then we can easily use the following code to achieve the goal. Unfortunately, SAS did not provide the eval() function.

 
data b; 
  set a;
  d=eval(c);
run;

SAS provides Macro function %eval, %sysevalf to evaluate arithmetic and logical expressions using integer or floating-point arithmetic. SAS also provides function resolve() (not supported in the DATA Step that runs in CAS) to return the resolved value of the argument after it has been processed by the macro facility. Anyway, resolve() can’t access the values of macro variable assigned by symput() or symputn() at program execution phase. So, for the SAS code as below, it outputs c=150 and d=200. For simple arithmetic and logical expression, we can use the tranwrd() function to replace all variable names with real values, and then use resolve('%sysevalf('|| cats(expression_with_real_values) || ')') to get evaluated result, but it limited to only SAS functions supported by %eval() and %sysevalf() macro functions.

%let a=100;
%let b=50;
data _null_;
  call symput("b", 200);  
  c=resolve("%eval(&a+&b)");
  put c=;
  d=symget("b");
  put d=; 
run;

Now let’s return to the key question we raised: How can we implement the eval() function in SAS Data Step? The best solution we found so far is to use the dynamic code generation mechanism in SAS, and then submit the code to SAS for parsing and execution. In this way, we retrieve string expression from a variable to a real expression in SAS code successfully. We don’t care what the valid SAS expression is, we totally convert it to the code snippets and submit it for execution. Syntax check, reference to PDV variables, internal functions, and even SAS Macro variables are all supported. Yes, it’s a smart and concise implementation for general purposes.

Due to each observation having a different computing rule, we need to use _N_ to control the observation mapping. So, for the existing dataset A, we can use the following code to achieve the goal. The key secret is to use CALL EXECUTE to generate SAS code dynamically and delay execution after the code is ready. In the output dataset B, we have a numeric column D with the value evaluated with the character expression from column C. You can see the output on the RIGHT part of the Figure 1.

data _null_;
  set a end=last;
  if _n_=1 then call execute("data b; set a;");
  call execute( "if _N_=" || compress(_N_) || " then d=" || trim(c) || "; ");
  if last then call execute("run;");
run;
proc print; run;

Wrap up as a reusable SAS macro

We also can wrap up the above SAS code in a reusable SAS macro, so this logic can be used anywhere you want. The %evalvar macro has four arguments: ds= and var= are the input dataset and columns with expression to be evaluated. And outds= and outvar= are the output dataset and columns to store result. We also can specify same value for ds= and outds=, so the user just expands the existing dataset with an additional evaluated column.

%macro EvalVar(ds=, var=, outds=, outvar=);
data _null_;
  set &ds end=last;
  if _n_=1 then call execute("data &outds; set &ds;");
  call execute( "if _n_=" || compress(_n_) || " then &outvar=" || &var || ";");
  if last then call execute("run;");
 run;
%mend; 
 
%EvalVar(ds=a, var=c, outds=b, outvar=d);
proc print; run;

By the way, I also had a temporary SAS code generation solution implemented via the %include SAS macro. The temporary SAS code will be deleted automatically when the SAS session is closed. The sample code is also attached here for your information.

filename tmpfile temp;
data _null_;
  file tmpfile;
  set a end=last;  
  if _N_=1 then put "data b; set a;";
  put "if _N_ =" _N_ "then d=" c ";";
  if last then 	put "run;";
run;
%include tmpfile;
proc print; run;

If we need to evaluate SAS expressions against a big table dynamically, we need to take consideration of computational complexity. We should use random (direct) access to dataset instead of using _N_ condition to improve performance and turn off some SAS system options when we submit large number of SAS codes at runtime, e.g. option nomprint nomlogic nosource;

%macro EvalVarEx(ds=, var=, outds=, outvar=);
data _null_;
  set &ds end=last;
  if _n_=1 then call execute("data &outds;");   
  call execute( "_i_=" || compress(_n_) || "; set &ds point=_i_;");
  %let i=1;
  %let v=%scan(&var, &i);
  %let y=%scan(&outvar, &i);
  %do %while( &v ^=); 
    if not missing(&v) then call execute("&y="|| &v || ";"); 
    %let i=%eval(&i+1);
    %let v=%scan(&var, &i);
    %let y=%scan(&outvar, &i);
  %end;
  call execute("output;  ");
  if last then call execute("stop; run;");
 run;
%mend; 
 
data a; 
  length a b 8 c d $ 255;
  do i=1 to 20000;
    a=3; b=4; c='a**2+b**2'; d='a+b'; output;    
  end; 
  drop i; 
run;
option nomprint nomlogic nosource;
%EvalVarEx(ds=a, var=%str(c,d), outds=b, outvar=%str(e,f));
 
proc print data=a (obs=100); run;
proc print data=b (obs=100); run;

Summary

In this article, we talk about how to evaluate SAS expressions in Data Step dynamically. The expression parsing and execution are totally handled by SAS at program execution phase. It avoids handling abstract syntax tree parsing and evaluation ourselves on it. We introduce two dynamic code generation implementations via call execute or %include. We also use _N_ to control observation mapping due to each observation has different computing rules. This implementation can reflect the beauty of simplicity in SAS before SAS provides system function eval() one day.

Share

About Author

Yinliang Wu

Director of Research and Development at SAS

Yinliang Wu has over 23 years software industry management and business experience. He focuses on data science, parallel computing and visualization such as AI, BI, big data, data visualization, quantitative trading, web crawler etc. Currently he is working on SAS Visual Analytics product family research and development. He received his Master of Science degree from Peking University. He is author of the book SAS 技术内幕 (2018) [Insider of SAS Technology: From Programmer to Data Scientist] and co-author of the book "数据分析实用教程 (2021) [Data Analysis Practical Tutorial] ".

10 Comments

  1. Hi, I have tried with 'if _n_ eq', _n_ trick. Input data is with 200K records. entire data step with call execute took about 3 hours to execute. I tried differently thinking why to read all other observations from dataset within call execute since we want to execute condition only when observation number matches between both dataset. since I am reading the same dataset within call execute as well I tried with POINT= option. Execution completed in 5 mins compared to ~3 hours earlier.

    below is the program section. cmd_all, cmd_lo, cmd_hi, cmd_tg are variables with sas statments to be executed.
    I removed 'if _n_ eq', _n_ condition since we reading data with POINT= option.

    data _null_;
    set DT03 end=last;
    by usubjid parcat1 parcat2 paramcd avisitn;
    OBS=_n_;
    if _n_=1 then
    do;
    call execute('data DT04;');
    end;
    call execute('do z='||cats(OBS)||'; set DT03 point=z;');
    if not missing(cmd_all) then call execute(cmd_all);
    if not missing(cmd_tg) then call execute(cmd_tg);
    if not missing(cmd_lo) then call execute(cmd_lo);
    if not missing(cmd_hi) then call execute(cmd_hi);
    call execute('output;end;');
    if last then call execute('stop;Run;');
    run;

    • Yinliang Wu

      Thanks Bhavin for raising this performance concern, I had never thought evaluating expression dynamically for big data :). When we process dataset with large rows, we do need to take consideration of computational complexity.
      It is quite good that you already use random access to read the dataset, it would sigificantly improve the performance. I tried 3 types imprvoment, and it turns out that even we can improve the _n_ condition with "if-else if" and "select-when", the output is still not as good as "random acecss" solution. The computational complexity is listed as below:
      Type Complexity Time (20K rows)
      if: N^2 43s
      if/else if: N^2/2 32s
      select-when:N^2/2 29s
      random access: N 17s

      Please note that the performance is also related to the code lines dynamically generated to I/O from large data, so turn off some sas options will be helpful.
      option nomprint nomlogic nosource nonotes;

  2. For SAS Viya users, here are two other potential solutions:

    1. CASL execute() function

    The is a CASL Built-in function that compiles and executes CASL code. For this example, I create data set "a" in CASUSER (the CAS version of the WORK library), and I replace the &MVAR macro variable reference with MVAR, where MVAR is a CASL variable that is set to the current value of the macro variable at run time. Using PROC CAS, CAS table "a" is pulled into a result table, which is iterated over using execute() to evaluate the expression.

    data casuser.a; 
      length a b 8 c $ 255;
      a=3; b=4; c='a**2+b**2'; output; 
      a=7; b=22; c='b/a * MVAR'; output;         /* Replaced &MVAR with MVAR */
      a=113; b=355; c='cos(b/a)'; output; 
      a=0; b=1; c='cdf("NORMAL", 0, a, b)'; output; 
    run;
     
     
    %let MVAR=2;
     
    proc cas;
      table.fetch result=r / table="a";
      table_a = r.Fetch;
      table_b = newtable("table_b", {'a','b','c','d'}, {'double','double','string','double'});
     
      MVAR = symget('MVAR');
     
      do row over table_a;
         a=row.a; 
         b=row.b;
         execute('d='||row.c||';');
         row = {row.a, row.b, row.c, d};
         addrow(table_b, row);
      end;
     
      saveresult table_b dataout=work.b;
    quit;
     
    proc print data=b;run;

    2. PROC PYTHON

    If enabled, Viya users can insert Python code directly into SAS programs. For the "a" data set I again replace &MVAR with MVAR, a Python variable set to the current value of the macro variable at run time. Also, the trigonometric and probabilistic SAS functions are replaced with Python equivalents. Using PROC PYTHON, data set "a" is pulled into a DataFrame, which it iterated over using eval() to evaluate the expression.

    data a; 
      length a b 8 c $ 255;
      a=3; b=4; c='a**2+b**2'; output;  
      a=7; b=22; c='b/a * MVAR'; output;                 /* Replaced &MVAR with MVAR */ 
      a=113; b=355; c='math.cos(b/a)'; output;           /* Replaced SAS function with Python equivalent */
      a=0; b=1; c='norm.cdf(0,loc=a,scale=b)'; output;   /* Replaced SAS function with Python equivalent */
    run;
     
     
    %let MVAR=2;
     
    proc python;
      submit;
     
      import pandas
      import math
      from scipy.stats import norm
     
      MVAR = float(SAS.symget('MVAR'))           # Get the value of MVAR macro variable
      df = SAS.sd2df('work.a')                   # Pull data set "a" into a dataframe
      df['MVAR'] = MVAR
     
      for (i, a, b, c, MVAR) in zip(df.index, df.a, df.b, df.c, df.MVAR):
        df.at[i,'d'] = eval(c)
     
      ds = SAS.df2sd(df, 'work.b (drop=MVAR)')   # Write dataframe back to data set "b"
     
      endsubmit;
    run;
     
    proc print data=b;run;
    • Yinliang Wu

      Thanks Deryl for sharing the PROC CAS & PROC PYTHON solutions for SAS Viya users. it provides SAS programers more choices.

  3. Peter Lancashire on

    Executing literal code without checking it leaves you open to a code injection attack. Use this method only if you trust the source of the data.

    Consider what would happen if an attacker provided "0; run; x 'something-bad';" and you had not prevented use of the X statement.

    In general mixing code and data in this way will give you not only a security but also a maintenance headache.

    • Yinliang Wu

      Thank you, Peter. In general, executing literal code always has code injection risk, this is why software providers empahsize security more and more, and try to seperate boundary of data and code. Beside X command, SAS also provide a lot of functions/routines under 'Special' and "File I/O" category, which may do something bad to system. However, on the other side, business demand is complicated and sometimes users have to find the balance between security and flexibity/extensibility, platform/language provider has to find a balance and try to satisfy users for this logic. In fact, almost all eval() function in different languages have the same risk, even SQL has the famouse SQL Injection, users have the responsibity to ensure data and code are both safe enough, and this logic only be used in quite limited area such as dynamic evaluation. Users must knows the data and code well and adopt this method prudently, adding a token check logic would be good. Here I attached a quick example code:

      proc fcmp outlib=work.func.safe;
        function checksafe(exp $);
          length token $ 40;
      	index=0;
      	do until(token=' ');
      	  index=index+1;
      	  token=scan(exp, index);    
      	  if lowcase(token) in ("x" "system" ) then return(0);					
      	end;
          return(1);
        endsub;
      quit;
      option cmplib=work.func;

      Then add "if checksafe(c)=1 then " before the "call execute(...)", we can simply exclude the x command and system function/call routine etc. If an observation contains predefined danger tokens, execution logic is going to be excluded.

      data _null_;
        set a end=last;  
        if _n_=1 then call execute("data b; set a;");   
        if checksafe(c)=1 then 
          call execute( "if _N_=" || compress(_N_) || " then d=" || trim(c) || "; ");
        if last then call execute("run;");
      run;
  4. Thanks for writing about this topic. Using the CALL EXECUTE statement in the DATA step to generate code is extremely useful. Because this post describes several issues, I want to emphasize to your readers that the complexity in using CALL EXECUTE is because the generated SAS statements do not execute until AFTER the DATA step completes. That's why you must include “data B; set A;” and “run;” as part of the program.

    You can use the OPTIONS SOURCE option to see the code that is generated by the CALL EXECUTE statements. This can be useful to understand a program that is not working correctly. For example, the following DATA step is wrong, and the SOURCE option shows why:

    options source;
    data wrong;
    set A;
    call execute(catx(' ','d =', c, ';'));  /* does not work as intended */
    run;

    The SAS log shows that the generated statements are

         d = a**2+b**2 ;
         d = b/a * 2 ;
         d = cos(b/a) ;
         d = cdf("NORMAL", 0, a, b) ;

    This is not a valid SAS program. Even if you put "DATA B; set A;" before these statements and "RUN;" after these statements, you do not get the desired answer because for every observation the final statement (d=CDF(...)) is used to assign the result to d.

    This is why using the "IF _N_ =" trick is so valuable. With that trick, the correct program is implemented as follows:

    options source;
    data Result;
    set A end=eof;
    length cmd $60;
    if _n_=1 then call execute('data B; set A;');
                  cmd = catx(' ','if _n_ eq', _n_, 'then d =', c,';');
                  call execute(cmd);
    if eof   then call execute('run;');
    run;
     
    proc print data=Result; run;
    proc print data=B; run;

    The log reveals that the generated statements are now correct:

    data B; set A;
    if _n_ eq 1 then d = a**2+b**2 ;
    if _n_ eq 2 then d = b/a * 2 ;
    if _n_ eq 3 then d = cos(b/a) ;
    if _n_ eq 4 then d = cdf("NORMAL", 0, a, b) ;
    run;

    I've defined a new variable called CMD in the RESULT data set that shows the statements that will be executed when the DATA step exits. This is a useful trick.

    Again, thanks for discussing this topic. For readers who might wonder when to use this technique, one application is to write a template in the Graph Template Language (GTL) that modifies a graph that is generated by a SAS procedure.

    • Yinliang Wu

      Thanks Rick for clarification. Yes, the code is short, but more clarification for its complexity is needed.
      SAS provide CALL EXECUTE(arugment) is very helpful to extend program logic at runtime. Here argument is a string constant or variable with SAS MACRO or SAS Statement.
      However, string quoted with single quote mark ' and double quote mark " have different logic:

      If argument is quoted with double quote mark ", the argument is parsed before current data step compliation. if it is quoted with single quote mark ', the argument is parsed during program execution. Anyway, MACRO code in argument always being triggered when argument is parsed (before current DATA Step or at CALL EXECUTE statement), while non-MACRO code always be triggered after current DATA step code execution finished.

    • Yinliang Wu

      There are another 2-3 ways to do the same thing, but they are not the best practice. somebody may have concern about the IF logic, and try to find another code generation approach - Yes, we do can generate small code snippet, but it has less performance as it use a small execution context (seperate data step compilation unit), or too heavy execution context (seperate session). However, they are helpful for some specific senarios, but not for general solution. Immediate execution approach can't write back to the same dataset due to write conflict by record level control.

Leave A Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Back to Top