Reasons to love PROC DS2

9

ProblemSolversPROC DS2 (DS2) is a new SAS programming language that extends the power of the traditional DATA step. Some customers have wondered if this new functionality is simply the “DATA step on steroids,” meaning that we simply added more robust functionality so that it’s a replacement for the DATA step. While it includes extra functionality that allows you to do more advanced data processing and manipulation, the addition of methods, packages, additional data types, and the power to run code inside databases boosts the power of this new procedure. For a more in-depth discussion, see the “Similarities between DS2 and the DATA Step” web page and the “Differences between DS2 and the DATA Step” web page. DS2 is not a replacement for the DATA step as INFILE, INPUT, FILE, PUT, and DATALINES statements aren’t permitted. Although most DS2 customers are using DS2 against database tables, it can also be used against traditional SAS data sets.

If DS2 is new to you, you are probably asking why you should invest time to learn about it rather than staying with the ever-reliable DATA step. Let me discuss several reasons why others are finding that DS2 is a welcomed addition to Base SAS®:

  • Additional precision gained from using new data types
  • Threaded processing for computationally intensive processing
  • Methods and packages for easy deployment of reusable code modules
  • Embedded PROC FedSQL – DATA step power and SQL simplicity all in the same program
  • Code can be run inside massively parallel databases like Teradata, Hadoop and Greenplum

Additional Precision Gained from Using New Data Types

While Base SAS allows only character and numeric data types, DS2 offers numerous others: BIGINT, BINARY, CHAR, DATE, DECIMAL, DOUBLE, FLOAT, INTEGER, NCHAR, NVARCHAR, REAL, SMALLINT, TIME, TIMESTAMP, TINYINT, VARBINARY, and VARCHAR. You can read more about each type on the “What Are the Data Types?” web page.

If you’re using DS2 to write to a SAS data set, you will still be limited to storing the standard character and numeric data types, even if you’ve processed other DS2 data types in the code. If your data is stored in one of the supported data sources listed on this web page, you can read from and write to them while using data types native to those databases. For example, DS2 can provide the extra precision you need to get accurate, precise answers by using the DECIMAL data type instead of a standard SAS 8-byte numeric variable.

Threaded Processing for Computationally Intensive Processing

Unless thread programs are created and run inside DS2, the program runs sequentially. When your SAS program is computationally intense, it might be advantageous to create a thread program so the data is spread out over multiple processes to become more efficient. Enclose the DS2 code between THREAD and ENDTHREAD statements, use the DECLARE THREAD statement to create an instance of the thread in the DS2 program, and execute the thread with a SET FROM statement. If one computation thread can keep up with the I/O thread, then that single thread is used for all processing. It’s also important to remember that a single reader feeds all threads. If you’d like to know how the observations were spread out over the threads, you can use the _THREADID_ variable. Here’s an example of a thread program, including the _THREADID_ variable:

data test; 
do x=1 to 12316;
 j=x*x; 
output; 
end; 
run; 
 
proc ds2; 
  thread newton/overwrite=yes; 
    dcl double y count; 
    dcl bigint thisThread;
    drop count; 
    method run(); 
      set test; 
      y=(x*j*x)/(j+x); 
      thisThread=_threadid_; 
      count+1;
    end; 
    method term();
      put '**Thread' _threadid_ 'processed'  count 'rows:';
    end;
  endthread; 
  run; 
quit; 
 
proc ds2; 
  data _null_; 
    dcl thread newton frac; 
    method run(); 
         set from frac threads=3; 
    end; 
  enddata;
  run; 
quit;

This is the output in the SAS log that shows the number of rows processed by each thread.

**Thread 1 processed 4124 rows
**Thread 0 processed 4096 rows
**Thread 2 processed 4096 rows

Methods and Packages for Easy Deployment of Reusable Code Modules

In DS2, all code must reside within a method. The three system methods are INIT, RUN, and TERM, and at least one must be specified in each DS2 program. You can write custom methods that can be stored for future use and that alleviate the need to recreate the code each time. Packages are collections of methods and variables. Although you can create custom packages, DS2 provides some packages that are predefined with methods to aid in your programming needs. Predefined packages include FCMP, Hash, HTTP, JSON, Logger, Matrix, SQLSTMT, and TZ. You can read more about them on the “Predefined DS2 Packages” web page.

Embedded PROC FedSQL – DATA Step Power and SQL Simplicity All in the Same Program

SAS FedSQL is a SAS proprietary implementation of the ANSI SQL:1999 core standard, and it allows vendor-neutral access to many data sources. This functionality can be used in the new procedure, PROC FedSQL, but you can also embed FedSQL queries in the SET statement within DS2. When a query is used in the SET statement, the resulting rows become the input to the remaining DS2 code. For example,

  set {select a.id,a.income,b.expenses 
         from lib.dset as a left join lib2.dset2 as b
         on a.id=b.id
         where a.income >40000
         order by a.id  };

The embedded FedSQL code gets converted to the native database code as needed.

Code Can Be Run Inside Massively Parallel Databases Like Teradata, Hadoop, and Greenplum

The SAS® In-Database Code Accelerator enables you to publish a DS2 thread program to the database and execute that thread program in parallel inside the database. DS2 thread programs can run inside Hadoop, Greenplum, and Teradata, while DS2 data programs can run inside Hadoop and Teradata. This has been a favorite aspect of DS2 for many customers thus far. To take advantage of this feature, you must license the SAS In-Database Code Accelerator, and the SAS/ACCESS® Interface to your database (Hadoop, Greenplum, or Teradata) in addition to Base SAS. The SAS® Embedded Process must be installed and configured on your database. Read more about the SAS In-Database Code Accelerator on the “Using the SAS In-Database Code Accelerator” web page.

While DS2 isn’t meant to replace the DATA step, there are numerous reasons why a customer might choose to use this new functionality. If you’re unsure if you should give it a try, or if you’d like to read more about it, here’s the link to the SAS® 9.4 DS2 Language Reference.

As always, we in Technical Support are here to answer your questions and offer tips and suggestions as you use our software. Thanks for using SAS!

Share

About Author

Kim Wilson

SAS Technical Support Analyst

Kim Wilson is a Technical Support Analyst in the Foundation SAS group in Technical Support. She has been a SAS user since 1996, and provides general support for the DATA step, macro, and Base procedures. She has written several papers and presented them at various SAS conferences and user events.

9 Comments

  1. Hi, Kim. This article is very helpful. Could you please explain how DS2 and SAS/BASE can be used in SAS-IMSTAT? How can do programming in IMSTAT and how these DS2 and SAS/BASE work together with IMSTAT? Could you please explain this puzzle? Thank you so much for your time.

  2. Nice overview of the "good stuff" baked in to DS2, Kim! I'm particularly fond of DS2 packages which enable object-oriented design for DS2 programs. And who wouldn't love the lightning-fast execution resulting from the ability to execute the DS2 program in-database, right where the data "lives", instead of the traditional process of bring the data out to the SAS platform? Thanks for putting this together.
    Mark

Leave A Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Back to Top