How to evaluate SAS expressions in PROC DS2 dynamically

2

Some readers read the article “how-to-evaluate-sas-expression-in-data-step-dynamically” and wonder if there is a same mechanism or functionality in DS2. As indicated in that article, SAS provides similar features in DATA step, PROC CAS and PROC Python, but some projects like ESP (Event Stream Processing) projects would store those expression definition in Database, and dynamically evaluate the expression at runtime. Even more, those Realtime system need good performance, which means no overhead and no latency should be introduced to existing system.

After reviewing DS2 Language reference, it turns out that we can leverage the similar expression evaluation feature with the DS2 function SQLEXEC. So here we go the initial prototype SAS code for it.

/*Prepare sample test data*/
%let MVAR=2;
data a; /* Create sample data with SAS MACRO var in expression */
  length a b 8 c $ 255;
  do id=1 to 809;
    a=3; b=4; c='a**2+b**2'; output; /* Arithmetic or Logical */  
    a=7; b=22; c='b/a * &MVAR'; output; /* SAS Macro Var*/
    a=113; b=355; c='cos(b/a)'; output; /* Trigonometric */
    a=0; b=1; c='cdf(''NORMAL'', 0, a, b)'; output; /* Probability - FedSQL only support single quote */  
  end;
run;

Row by row evaluation

Since DS2 doesn’t support double quote, we use two single quotes escape in the upper argument of CDF function. With upper dataset a, the code to evaluate column c expressions row by row is as below:

data a;
  set a;    
  id=_N_;  /*create temp column to store row id*/
  length c_r $ 255; 
  c_r=resolve(c); /*create temp column to store expression after resolve MACRO variables*/
  d=.;    /*create temp column to store evalution result*/
run;
data tempds;
  set a (keep=c_r); /*due to we can't update a table while openning, we create temp table*/
run;
proc ds2;
  data _null_;
    method run();	  
	  dcl int rc;
      set tempds;
	  rc=sqlexec('update a set d = (' || c_r || ') where id='|| _N_ || ';'); 
	end;
  enddata;
run;
quit;
proc datasets nolist; /*clear the temp table*/
  delete tempds;
quit;
data a; /*clear the temp columns id and c_r*/
  set a(drop=id c_r);
run;

 

Batch update based on unique expressions

As the upper code evaluates expression defined in column c row by row, the performance is not good enough. We can leverage the batch processing to gain high performance with the SQLEXEC statements. The secret is to suppress evaluation to unique expression count. At the same time, we can avoid redundant temp variable for row id.  So we have the batch processing version prototype code as below:

data a;
  set a;   
  d=.;/*create temp column to store evalution result*/
run;
 
proc sort data=a(keep=c) nodupkey out=tempds; 
  by c; /*Build unique expression set*/
run;
data tempds; 
  length c_r $ 255;
  set tempds(keep=c);
  c_r=resolve(c);/*create temp column to store expression after resolve MACRO variables*/
 run;  
proc ds2;
  data _null_;     
    method run();	  
	  dcl int rc;
      set tempds;	  
	  rc=sqlexec('update a set d = (' || trim(c_r) || ') where c =''' || tranwrd(trim(c),'''','''''') ||  '''');  	  
	end;
  enddata;
run;
quit; 
proc datasets nolist;/*clear the temp table*/
  delete tempds;
run;
quit;

Please pay attention to the where clause definition in SQLEXEC, we pass the new expression after Macro variable resolved in column c_r to SQLEXEC function, and we use original expression in column c to filter rows. We also have to escape the single quote via TRANWRD function to make upper code works. It’s very tricky here due to DS2 only support single quote string.

Multi-threaded batch update with DS2 thread package

When the data store in database, we need to avoid data transfer as much as possible. So suppose the table already contains column d for expression evaluation result, and the target Oracle database already has been defined in SAS library oracle. The follow code use DS2 thread package implements the expression evaluation with multiple threads. Suppose each unique expression use one thread for batch process. If threads number specified by user is less than distinct expression number, the same thread will process multiple expressions.

LIBNAME ora ORACLE PATH=<TNSNAME> USER=<USERID> PASSWORD="<PASSWORD>" ;
 
proc datasets lib=ora nolist;
  delete a;
quit;
data ora.a; 
  length a b 8 c $ 256;
  do id=1 to 809;
    a=3; b=4; c='a**2+b**2'; output; /* Arithmetic or Logical */  
    a=7; b=22; c='b/a * &MVAR'; output; /* SAS Macro Var*/
    a=113; b=355; c='cos(b/a)'; output; /* Trigonometric */
    a=0; b=1; c='cdf(''NORMAL'', 0, a, b)'; output; /* Probability */
  end;   
run; 
proc fedsql; /*Use FedSQL to create a column to store result*/
	alter table ora.a add column d float;
quit;
 
%let MVAR=2;
 
proc sort data=ora.a(keep=c) nodupkey out=tempds; 
	by c;/*Build unique expression set*/
run;
data tempds;
  length c_r $ 256;
  set tempds(keep=c);
  c_r=resolve(c); /*create temp column to store expression after resolve MACRO variables*/
 run;   
 
proc ds2;
  thread thread_a /overwrite=yes;/*Evaluate based on unique expression with multiple threads*/    	    
	method run();	        
	  dcl int rc;
	  set tempds; 	  
	  by c;	  
	  rc=sqlexec('update ora.a set d = (' || trim(c_r) || ') where c =''' || tranwrd(trim(c),'''','''''') ||  '''');	  
	end;	
  endthread;
  data _null_ / overwrite=yes;
    dcl thread thread_a t();
	method run();
      set from t threads=2; 
	end;
  enddata;
run;
quit;
 
proc datasets nolist;/*clear the temp table*/
  delete tempds;
quit;
proc print data=ora.a  ;
run;

With upper three implementations, the performance is quite different for DS2 running against the same 3236 observations:

Implementation Type Real Time (seconds) CPU Time (seconds)
Row-by-Row Evaluation 39.00 18.00
Batch Update Based on Unique Expressions 0.08 0.04
Multi-Threaded Batch Update with DS2 Thread Package 0.19 0.12

 

For 1000000 observations, the third approach performing multiply operation on two columns needs about 24 seconds with single thread, and 12 seconds with two threads. It means only about 10~20 microseconds are needed for one evaluation. Real time system usually uses small data at a time, so fast expression evaluation is possible for that scenario.

We can also evaluate SAS expression in PROC FedSQL, and research indicate PROC FedSQL has better performance (about 20% time less) than SQLEXEC of PROC DS2 with single thread. For PROC DS2, only one statement can be run in SQLEXEC function (but we can run SQLEXEC multiple times), but multiple FedSQL statements can be submitted in one PROC FedSQL session.

Summary

In this article, we talked about how to evaluate SAS expressions in PROC DS2 with SQLEXEC function dynamically, and we also went thru three different implementations to improve performance as much as possible. It’s possible to achieve fast expression evaluation in DS2 for cross-sectional data. We also can use this approach to implement custom calculated columns for a table, the real value can be pre-calculated or evaluated at runtime. It's very flexible to integrate this dynamic mechanism into customer's real scenario.

Share

About Author

Yinliang Wu

Director of Research and Development at SAS

Yinliang Wu has over 23 years software industry management and business experience. He focuses on data science, parallel computing and visualization such as AI, BI, big data, data visualization, quantitative trading, web crawler etc. Currently he is working on SAS Visual Analytics product family research and development. He received his Master of Science degree from Peking University. He is author of the book SAS 技术内幕 (2018) [Insider of SAS Technology: From Programmer to Data Scientist] and co-author of the book "数据分析实用教程 (2021) [Data Analysis Practical Tutorial] ".

2 Comments

  1. Very impressive thought, I never think of "unique expressions" before, it should be very helpful in data-driven work.
    I have a question: Could you explain how does the TRANWRD function make the UPDATE statement work?

    • Yinliang Wu

      As we can use batch update, unique expression would be very helpful and efficient.
      TRANWRD function executes before sqlexec function submit the fedsql UPDATE statement, so it give programmer the chance.

Leave A Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Back to Top