Partitioning in Hadoop, sorting in SAS--same results, different methods

0

SAS In-Memory Statistics for Hadoop is a single interactive programming environment for analytics on Hadoop that  integrates analytical data preparation, exploration, modeling and deployment. It’s principle components are the IMSTAT procedure (PROC IMSTAT) and the SAS LASR Analytic Engine (or SASIOLA engine for input-output with LASR).

Within the SAS In-Memory Statistics for Hadoop environment, the SAS LASR Analytic Engine provides most of the functionality we associate with Base SAS; whereas, PROC IMSTAT covers the full analytical cycle from data manipulation and management, through modeling, towards deployment. This duality continues SAS's long standing tradition of  getting the same job done in different ways, to accommodate users' different style, constraints and preferences.

This post is one of several upcoming posts I plan to publish soon that discuss code mapping of key analytical data exercises from traditional SAS programming to SAS In-Memory Statistics for Hadoop. This post today covers sorting and sorting-related BY variables and ordering in SAS In-Memory Statistics for Hadoop.

How to sort without PROC SORT

Almost every modeler or analyst who has ever prepared data for modeling using SAS tools is familiar with the SORT procedure, using the SET statement and BY statement and FIRST. and LAST. processing.

In SAS In-Memory Statistics for Hadoop, PROC SORT is not explicitly supported as a syntax invocation. However, the act of sorting is definitely supported. The name and syntax invocation are now accomplished through the PARTITION statement on PROC IMSTAT or when reading data with the SASIOLA engine.

Partitioning data with the SAS LASR Analytic Engine

The process and differences from traditional BASE SAS programming are best explained by exploring various elements of this code example:

libname dlq SASIOLA START TAG=data2;
   data dlq.in(partition=(key2) orderby=(month) replace=yes);
   set trans.dlq_month1
       trans.dlq_month2;
run;

  • There are two important options for the SASIOLA engine.Specifying the SASIOLA engine on the LIBNAME statement continues the spirit of SAS/ACCESS drivers you probably used to run SAS Access Interface to Oracle or older versions of SAS like the V8 engine.

    •  The START option on the LIBNAME statement is unique to SASIOLA. It simply tells SAS to launch the SASIOLA in-memory server.  You can release the library or shut down the server later.
    • The TAG= option is critical and is required for several reasons. One reason is to reference the data properly once it is loaded into the memory.  Another associated reason is to avoid potential collisions when multiple actions are happening in the memory space. Also, when loading data from, say, Hadoop where the source has many layers of locations, the two-level restriction embedded with traditional BASE SAS is no longer sufficient. The TAG= option will allow for long specification.
  • The SET statement can still be used to stack data sets. There is no limit to how many data sets you can read in with the SAS LASR Analytic Engine. One salient concern is sizing: how much memory space is needed to accommodate the input data sets combined—less a concern when running the SET statement in your traditional BASE code. Also noteworthy is the fact that multiple SET statements are not supported in the SASIOLA engine although you can read multiple input data sets with a single SET statement. Interesting question is: how much do you still need to engage multiple SET statements in this new in-memory computation context?
  • PARTITION= statement processes in parallel fashion. Now using the SASIOLA engine, sorting happens as the PARTITION= data set option. The statement partition=key2 is, logically, the same as proc sort; by key2; run;. However, this evolution is greater than just syntax or name and action switch. It reflects fundamental difference between analytical computing centering around Hadoop and traditional SAS BASE.

    • Hadoop is parallel computing by default. If you are running the SASIOLA engine on a 32-node Hadoop environment, the partitioning step naturally tries to load different partitions cross the 32 nodes, instead of jamming all the partitions into one single partition (sequentially) as is the case with PROC SORT.
    • PARTITION= statement is used is to put records pertaining to the same partition on the same node (there is, indeed, optimal padding and block size to consider). Accessing the partitions later, by design, is to happen in parallel fashion. Some of us call it bursting through the memory pipes. This is very different from SAS BASE where the processing grinds through observations one by one.
  • Grouping and ordering are different with the PARTITION= statement. As we have learned from our use of PROC SORT, the first variable listed at PROC SORT typically is to group, not to order. If you list only one variable for PROC SORT, you should intend only to group by that variable. For example, if the variable is account_number or segment label, analytically speaking, you rarely need to order by the variable values in addition to grouping by it. But PROC SORT, in most cases, orders the observations by the grouping variable anyway. This is not the case with partitioning with SASIOLA or SAS In-Memory Statistics for Hadoop in general.

    With the SASIOLA engine, as with PROC SORT, the following still apply:

    • You can list as many variables as you see fit on the PARTITION= statement.
    • The order of the variables listed still matters.
    • The same sense and sensibility remains that the more variables you list, the less analytical sense it makes, albeit still necessary.
  • You can engage PARTITION= as data set option for input data sets as well. My preference is to use it as summary processing on the outputdata set. There are cases where partitions rendered at the input are automatically or implicityly preserved into the output, and there are cases where the preservation does not happen.
  • ORDERBY= option processing happens within the partitions.For example, when you apply the following code PROC SORT; by key2 months; run; and you have multiple  entries of month=JAN, for example, using first.key2 later does not pin down the record for you, unless you add at least one more variable at the BY statement. This remains the case with partition=(key2) orderby=(month) under SASIOLA. If, however, the later action is to do summary by the two variables, running proc means; by key2 month; run; will yield different results from running summary under SASIMSH (PROC IMSTAT, to be specific), because in PROC IMSTAT only the variable key2 is effectively used and the orderby variable month is ignored.
  • REPLACE= data set option is a concise way to delete data. The REPLACE=YES statement is a concise way to effect proc delete data=dlq.in; or proc dataset lib=dlq; delete in; run;. This option carries obvious Hadoop flavor.

Partitioning data with the IMSTAT procedure

This code example helps explain the differences between partitioning data with the SASIOLA engine and PROC IMSTAT processing when sorting or grouping data. The end result is pretty much the same. The differences ae in the “way of life.”

PROC IMSTAT;
  table dlq.in;
  partition key2/orderby=month;
run;
  table dlq.&_templast_;
  summary x1 /partition;
run;

  • The IMSTAT procedure is true interactive programming. While both examples represent genuine in-memory computation, the SASIOLA engine method resembles traditional SAS BASE batch action.
  • Multiple RUN statements support data exploration. Within one single invocation of PROC IMSTAT, the analyst can use RUN statements to scope the whole stream into different sub-scopes so that the slice-and-dice exploration (such as the SUMMARY action) and modeling (not shown here) is happening while the data tables are floating in memory.
  • Nothing is saved to disk. In both examples, none of the resulting data sets are saved onto disk. They are all eventually parked in memory space. There is SAVE statement that allows the data to be downloaded to the disk.

In next posts, I will cover transpose and retain actions.

Thank you,
Jason

Share

About Author

Jason Xin

Pre-sales Analytics Solution Architect

Jason Xin has been an analytics practitioner for more than 15 years, with concentrations in financial services and life sciences. He covers predictive analytics, statistics, forecasting, optimization, machine learning and data management, both exquisite and heavy-duty. Xin focuses his work on creating designs and finding paths to each customer’s unique analytics needs. He is also passionate about solving problems, building a wide knowledge base for customers and providing incremental value through online engagement.

Leave A Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Back to Top