CAS data modeling for performance

12

CAS data modelingThe Cloud Analytic Server (CAS for short) is SAS’ latest high-performance, scalable, in-memory analytic data server.  In this post, I’d like to discuss the CAS physical data model, i.e.what features CAS offers for data storage, and how to use them to maximize performance in CAS (and consequently SAS Visual Analytics 8.1 too).

So, specifically let’s answer the question:

What CAS physical table storage features can we use to get better performance in CAS and SAS Visual Analytics/CAS?

CAS Physical Table Storage Features

The following data storage features affect how CAS tables are physically structured:

  • Compression
  • Partitioning
  • Sorting
  • Repeated Tables
  • Extended Data Types (Varchar)
  • User Defined Formats

Compression — the Storage Option that Degrades Performance

Table compression tends to decrease CAS performance in general and SAS Visual Analytics performance specifically.  While tables with extremely long character fields might see performance gains from compression, this is generally not the case. CAS uses the same compression algorithm as LASR and you should expect similar performance degradation.

Compression can be implemented via CAS action as well as a data set option. Using the data set option is demonstrated below:

data public.MegaCorp (compress=yes);
   set baselib.MegaCorp;
run;

Partitioning and Sorting

Partitioning is a powerful tool for improving CAS BY-Group performance.

Grouping and Ordering

Along with BY-Group processing in DATA Step, virtually all SAS Visual Analytics objects perform some sort of BY-Group analysis — SAS Visual Analytics graph objects (e.g. Bar Charts, Line Graphs) intrinsically group by categorical variables while SAS Visual Statistics objects (e.g. Decision Tree, Linear Regression) provide grouping as well as classification functionality.

When performing analyses/processing, CAS first groups the data into the required BY-groups. Pre-partitioning on commonly-used BY-groups means CAS can skip this step, vastly improving performance.

Within partitions, tables can be sorted by non-partition-key variables. Pre-sorting by natural ordering variables (e.g. time) allows CAS to skip the ordering step in many cases just like partitioning allows CAS to skip the grouping step.

For a full use-case, consider a line graph that groups sales by region and plots by date. This graph object would benefit greatly from a CAS table that is pre-partitioned by region and pre-sorted by date.

Join Optimization

Partitioning can also support join operations since both the CAS DATA Step Merge operation as well as the CAS FedSQL Merge Join algorithm utilize BY-GROUP operations to support their processing.

Pre-partitioning tables in anticipation of joins will greatly improve join performance. A good use case is partitioning both a large transaction table and an equally large reference table (e.g. an enormous Customer table) by the common field, customerID. When a DATA Step MERGE or a FedSQL join is performed between the two tables on that field, the join/merge will take advantage of partitioning for the BY-GROUP operation resulting in something similar to a partition-wise join.

Like Compression, partitioning and sorting can be implemented via CAS actions as well as data set options. Using the data set options is demonstrated below:

data mycas.bigOrderTable (partition=(region division) orderby=(year quarter month));
   set CASorBase.bigOrderTable;
run;

Repeated Tables

By default, in distributed CAS Server deployments, CAS divides incoming tables into blocks and distributes those blocks among its worker nodes. This approach works well in support of analytics on large single tables. The worker nodes collectively share the processing burden, working on their own piece of the table.

However, this scheme can lead to performance issues in certain circumstances. Repeated tables offer an alternative distribution method. With the DUPLICATE data set option or the REPEAT CASUTIL LOAD option set, CAS tables are copied in full to each worker node, instead of being divided.

Repeated Tables have two main use-cases in CAS:

1.     Join Optimization
2.     Small Table Operation Optimization

Join Optimization

For join operations, the default data distribution scheme can result in significant network traffic as matching records from the two tables travel between worker nodes to meet. If one of the two tables was created with the DUPLICATE/REPEAT option, then every possible record from that table is available on every node to the other table. There is no need for any network traffic.

Small Table Operation Optimization

For small tables, even single table operations can perform better with repeated instead of divided distribution. LASR actually implemented the “High Volume Access to Smaller Tables” feature for the same reason. When a table is repeated, CAS runs any required operation on a single worker node against the full copy of the table that resides there, instead of distributing the work.

As stated, repeated tables can be implemented with the DUPLICATE data set option, it can also be implemented with the REPEAT option on the PROC CASUTIL LOAD statement. The CASUTIL method is shown below:

proc casutil ;
   load data=sashelp.prdsale outcaslib=”caspath”
           casout=”prdsale” replace REPEAT ;
quit ;

Extended Data Types (VARCHAR)

With Viya 3.2 comes SAS’ first widespread implementation of variable length character fields. While Base SAS offers variable length character fields through compression, Viya 3.2 is the first major SAS release to include a VARCHAR data type.

While VARCHAR’s main function is to save storage space, it also improves performance by reducing the size of the record being processed. CAS, like any other processing engine, will process narrower records more quickly than wide records.

User Defined Formats

User defined formats (UDFs) exist in CAS in much the same way they do in Base SAS. Their primary function, of course, is to provide display formatting for raw data values. Think about a format for direction. The raw data might be: “E”, “W”, “N”, “S” while the corresponding format values might be “East”, “West”, “North”, “South.”

So how might user defined formats improve performance in CAS? The same way they do in Base SAS, and the same way that VARCHAR does, by reducing the size of the record that CAS has to process. Imagine replacing multiple 200 byte description fields with 1 byte codes. If you had 10 such fields, the record length would decrease 1990 bytes ((10 X 200) – 10). This is an extreme example but it illustrates the point: User defined formats can reduce the amount of data that CAS has to process and, consequently, will lead to performance gains.

Share

About Author

Stephen Foerster

Senior Manager, SAS Professional Services

Stephen Foerster is a Senior Manager in the Global Enablement and Learning (GEL) Team within SAS R&D's Global Technical Enablement Division. Stephen has almost 20 years experience designing and building data warehouses and decision support systems using SAS and other technologies. Stephen now uses that experience creating enablement material for SAS technologists worldwide.

12 Comments

  1. Hi Stephen,
    Thanks for sharing. Would it be helpful to increase the performance for Visual Analytics on Viya?
    Thanks!
    Roy

    • Stephen Foerster
      Stephen Foerster on

      Yes, absolutely. VA submits CAS actions behind the scenes so it will take advantage of any efficiencies implemented on the CAS tables it uses.

      • Thanks for the reply. Here is other question. How do I decide which partition key and order key should use for my VA report? Is there any rule to follow?
        Thanks!

        • Stephen Foerster
          Stephen Foerster on

          Roy, the idea is to minimize the work CAS has to do at query time (CAS action time). If a VA bar chart is grouping by region and the underlying data is already grouped (partitioned) by region then CAS doesn't have to group it. So the CAS action comes back quicker. Similarly, if a VA line chart is plotting data by date and the underlying data is already ordered by date then CAS doesn't have to sort it. So again the CAS action comes back quicker.

          So partition by the most common grouping variables and order by the most common order-by variables.

          • Thanks for your reply again! Since We are working on improve our customer's VA report performance on Viya and we have not found the way yet. Our customer's VA reports are full of filter object and list table,which means they didn't use the chart you mentioned before. Not sure if it wold be helpful for the VA report performance improvement, so would it be also helpful for filter object on VA report with the partition key?
            Thanks!

        • Stephen Foerster
          Stephen Foerster on

          First you'll need to get the formats registered in CAS. Information on that is here, http://go.documentation.sas.com/?cdcId=calcdc&cdcVersion=3.3&docsetId=caldatamgmtcas&docsetTarget=n0jtz261h5qtg7n1j6ql02gd65lo.htm&locale=en.

          You can apply registered formats to tables at load time. If loading with the CASUTIL LOAD statement, you would use the VARS option, http://go.documentation.sas.com/?docsetId=casref&docsetTarget=n03spmi9ixzq5pn11lneipfwyu8b.htm&docsetVersion=3.3&locale=en#p1gvn79hc0o7izn1egadg659q6da.

          You can also apply formats to already-loaded CAS tables using the ALTERTABLE CAS action, http://go.documentation.sas.com/?docsetId=caspg&docsetTarget=cas-table-altertable.htm&docsetVersion=3.3&locale=en.

          • Hi Stephen, Thanks for the reply. I have successfully created my user defined format on CAS, but I can't use it on my VA report after formatted my table on CAS. Here is the error when I import the CAS table on VA report. Not sure why I can't use it on my VA report.
            --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
            com.sas.cas.CASException: The user-defined programming statements could not be parsed. (severity=2 reason=6 statusCode=2710055)
            4 WARNING: License for feature 'TKCAS SAS Cloud Analytic Services Server' has expired and will stop working in 11 days. Contact your installation representative to obtain a renewal.
            4 WARNING: License for feature 'SAGEMID SAS Visual Analytics Explorer' has expired and will stop working in 11 days. Contact your installation representative to obtain a renewal.
            3 NOTE: Executing action 'simple.distinct'.
            5 ERROR: The user-defined programming statements could not be parsed.
            5 ERROR: Invalid FORMAT '$BRANCH_ID_F.' found.
            5 ERROR: Failure opening table 'B130082_OP_ORDER'
            5 ERROR: The action stopped due to errors.
            3 NOTE: Action 'simple.distinct' used (Total process time):
            3 NOTE: real time 0.000073 seconds
            3 NOTE: cpu time 0.086105 seconds (117952.05%)
            3 NOTE: total nodes 1 (8 cores)
            3 NOTE: total memory 117.88G
            3 NOTE: memory 0.00K (0.00%)
            debug=0x887ff837:TKCASTAB_PARSE_ERROR:
            --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
            Here is the sas code i used on SAS studio

            LIBNAME T cvp "/OA_share_folder";

            CAS mySession SESSOPTS=(messagelevel=all CASLIB=public TIMEOUT=999 LOCALE="en_US");
            CASLIB _ALL_ ASSIGN;

            proc format library=work.formats sessRef=mySession casFmtLib="SASFORMATS2";
            value $Branch_id_f
            "1"="A"
            "2"="B"
            "3"="C"
            "4"="D"
            "5"="E"
            "6"="F"
            ;

            run;

            cas mySession promoteFmtLib fmtLibName=SASFORMATS2 replace ;

            cas mySession saveFmtLib fmtLibName=SASFORMATS2 casLib=Formats
            table=SASFORMATS2 replace;

            data test ;
            set t.B130082_OP_ORDER;
            format Branch_id $Branch_id_f. func_type $func_type_f. Hosp_cnt_type $Hosp_cnt_type_f. Hosp_data_type $Hosp_data_type_f.;
            run;

            proc casutil;
            load data=test outcaslib="public" casout="B130082_OP_ORDER_stg"
            replace;
            quit;

            PROC CASUTIL;
            droptable casdata="B130082_OP_ORDER" incaslib="public";
            promote casdata="B130082_OP_ORDER_stg" casout="B130082_OP_ORDER" incaslib="public"
            outcaslib="public" ;
            quit;

            • Stephen Foerster
              Stephen Foerster on

              Nice work applying the format. Sorry for the bump in the road. Let's see if the format works in CAS but outside of VA. Can you try running the following in SAS Studio? (You can run all of these CAS actions together but I separated them to help isolate the problem if something fails)

              proc cas;
              table.columninfo / table="B130082_OP_ORDER" ;
              run ; quit ;

              proc cas;
              table.columninfo / table="B130082_OP_ORDER_stg" ;
              run ; quit ;

              proc cas; table.fetch /
              table={name='B130082_OP_ORDER',caslib='public'}
              /*fetchvars={{name='xxx',format='xxx'}}*/
              format=TRUE
              quit;

              proc cas; table.fetch /
              table={name='B130082_OP_ORDER_stg',caslib='public'}
              /*fetchvars={{name='xxx',format='xxx'}}*/
              format=TRUE
              quit;

              • HI Stephen, here are the results.

                proc cas;
                table.columninfo / table="B130082_OP_ORDER" ;
                run ; quit ;

                Results from table.columnInfo
                Column Information for B130082_OP_ORDER in Caslib Public
                Column Label Id Type Length Formatted
                Length Format Format
                Width Format
                Decimal
                BRANCH_ID 1 char 1 8 $F 0 0
                HOSP_DATA_TYPE 2 char 3 8 $F 0 0
                HOSP_CNT_TYPE 3 char 1 8 $F 0 0
                HOSP_ID_PLAIN 4 char 15 15 0 0
                HOSP_ABBR 5 char 30 30 0 0
                FEE_YM FEE_YM 6 double 8 11 F 11 0
                YR 7 double 8 12 0 0
                FUNC_TYPE 8 char 22 15 $F 0 0
                ORDER_CODE 9 char 22 22 0 0
                TYPE_C 10 char 21 21 0 0
                ORDER_QTY ORDER_QTY 11 double 8 27 F 27 2
                ORDER_DOT ORDER_DOT 12 double 8 26 F 26 0
                ORDER_QTY1 ORDER_QTY 13 double 8 27 F 27 2
                ORDER_DOT1 ORDER_DOT 14 double 8 26 F 26 0
                CURE_CNAME 15 char 60 60 0 0
                ATC5 16 char 27 27 0 0
                -----------------------------------------------------------------------------------------------------------------------------------------------------
                proc cas;
                table.columninfo / table="B130082_OP_ORDER_stg" ;
                run ; quit ;
                Results from table.columnInfo
                Column Information for B130082_OP_ORDER_STG in Caslib Public
                Column Label Id Type Length Formatted
                Length Format Format
                Width Format
                Decimal
                BRANCH_ID 1 char 1 8 $F 0 0
                HOSP_DATA_TYPE 2 char 3 8 $F 0 0
                HOSP_CNT_TYPE 3 char 1 8 $F 0 0
                HOSP_ID_PLAIN 4 char 15 15 0 0
                HOSP_ABBR 5 char 30 30 0 0
                FEE_YM FEE_YM 6 double 8 11 F 11 0
                YR 7 double 8 12 0 0
                FUNC_TYPE 8 char 22 15 $F 0 0
                ORDER_CODE 9 char 22 22 0 0
                TYPE_C 10 char 21 21 0 0
                ORDER_QTY ORDER_QTY 11 double 8 27 F 27 2
                ORDER_DOT ORDER_DOT 12 double 8 26 F 26 0
                ORDER_QTY1 ORDER_QTY 13 double 8 27 F 27 2
                ORDER_DOT1 ORDER_DOT 14 double 8 26 F 26 0
                CURE_CNAME 15 char 60 60 0 0
                ATC5 16 char 27 27 0 0
                ------------------------------------------------------------------------------------------------------------------------------------------------------------------

                proc cas; table.fetch /
                table={name='B130082_OP_ORDER',caslib='public'}
                fetchvars={{name='branch_id',format='branch_id_f'},{name='func_type',format='func_type_f'},{name='Hosp_cnt_type',format='Hosp_cnt_type_f'},{name='Hosp_data_type',format='Hosp_data_type_f'}};
                format=TRUE;
                quit;

                1 OPTIONS NONOTES NOSTIMER NOSOURCE NOSYNTAXCHECK;
                72
                73 proc cas; table.fetch /
                74 table={name='B130082_OP_ORDER',caslib='public'}
                75 fetchvars={{name='branch_id',format='branch_id_f'},{name='func_type',format='func_type_f'},{name='Hosp_cnt_type',format='
                75 ! Hosp_cnt_type_f'},{name='Hosp_data_type',format='Hosp_data_type_f'}};
                76 format=TRUE;
                NOTE: The submitted statements have been canceled.
                77 quit;

                NOTE: PROCEDURE CAS used (Total process time):
                real time 0.00 seconds
                cpu time 0.00 seconds

                proc cas; table.fetch /
                table={name='B130082_OP_ORDER_stg',caslib='public'}
                fetchvars={{name='branch_id',format='branch_id_f'},{name='func_type',format='func_type_f'},{name='Hosp_cnt_type',format='Hosp_cnt_type_f'},{name='Hosp_data_type',format='Hosp_data_type_f'}};
                format=TRUE;
                quit;
                1 OPTIONS NONOTES NOSTIMER NOSOURCE NOSYNTAXCHECK;
                72
                73 proc cas; table.fetch /
                74 table={name='B130082_OP_ORDER_stg',caslib='public'}
                75 fetchvars={{name='branch_id',format='branch_id_f'},{name='func_type',format='func_type_f'},{name='Hosp_cnt_type',format='
                75 ! Hosp_cnt_type_f'},{name='Hosp_data_type',format='Hosp_data_type_f'}};
                76 format=TRUE;
                NOTE: The submitted statements have been canceled.
                77 quit;

                NOTE: PROCEDURE CAS used (Total process time):
                real time 0.00 seconds
                cpu time 0.01 seconds

Leave A Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Back to Top