CAS data modeling for performance

By Stephen Foerster on SAS Users August 14, 2017 Topics | Analytics Data Management

CAS data modeling The Cloud Analytic Server (CAS for short) is SAS’ latest high-performance, scalable, in-memory analytic data server. In this post, I’d like to discuss the CAS physical data model, i.e.what features CAS offers for data storage, and how to use them to maximize performance in CAS (and consequently SAS Visual Analytics 8.1 too).

So, specifically let’s answer the question:

What CAS physical table storage features can we use to get better performance in CAS and SAS Visual Analytics/CAS?

CAS Physical Table Storage Features

The following data storage features affect how CAS tables are physically structured:

Compression
Partitioning
Sorting
Repeated Tables
Extended Data Types (Varchar)
User Defined Formats

Compression — the Storage Option that Degrades Performance

Table compression tends to decrease CAS performance in general and SAS Visual Analytics performance specifically. While tables with extremely long character fields might see performance gains from compression, this is generally not the case. CAS uses the same compression algorithm as LASR and you should expect similar performance degradation.

Compression can be implemented via CAS action as well as a data set option. Using the data set option is demonstrated below:

data public.MegaCorp (compress=yes);
   set baselib.MegaCorp;
run;

Partitioning and Sorting

Partitioning is a powerful tool for improving CAS BY-Group performance.

Grouping and Ordering

Along with BY-Group processing in DATA Step, virtually all SAS Visual Analytics objects perform some sort of BY-Group analysis — SAS Visual Analytics graph objects (e.g. Bar Charts, Line Graphs) intrinsically group by categorical variables while SAS Visual Statistics objects (e.g. Decision Tree, Linear Regression) provide grouping as well as classification functionality.

When performing analyses/processing, CAS first groups the data into the required BY-groups. Pre-partitioning on commonly-used BY-groups means CAS can skip this step, vastly improving performance.

Within partitions, tables can be sorted by non-partition-key variables. Pre-sorting by natural ordering variables (e.g. time) allows CAS to skip the ordering step in many cases just like partitioning allows CAS to skip the grouping step.

For a full use-case, consider a line graph that groups sales by region and plots by date. This graph object would benefit greatly from a CAS table that is pre-partitioned by region and pre-sorted by date.

Join Optimization

Partitioning can also support join operations since both the CAS DATA Step Merge operation as well as the CAS FedSQL Merge Join algorithm utilize BY-GROUP operations to support their processing.

Pre-partitioning tables in anticipation of joins will greatly improve join performance. A good use case is partitioning both a large transaction table and an equally large reference table (e.g. an enormous Customer table) by the common field, customerID. When a DATA Step MERGE or a FedSQL join is performed between the two tables on that field, the join/merge will take advantage of partitioning for the BY-GROUP operation resulting in something similar to a partition-wise join.

Like Compression, partitioning and sorting can be implemented via CAS actions as well as data set options. Using the data set options is demonstrated below:

data mycas.bigOrderTable (partition=(region division) orderby=(year quarter month));
   set CASorBase.bigOrderTable;
run;

Repeated Tables

By default, in distributed CAS Server deployments, CAS divides incoming tables into blocks and distributes those blocks among its worker nodes. This approach works well in support of analytics on large single tables. The worker nodes collectively share the processing burden, working on their own piece of the table.

However, this scheme can lead to performance issues in certain circumstances. Repeated tables offer an alternative distribution method. With the DUPLICATE data set option or the REPEAT CASUTIL LOAD option set, CAS tables are copied in full to each worker node, instead of being divided.

Repeated Tables have two main use-cases in CAS:

1. Join Optimization
2. Small Table Operation Optimization

Join Optimization

For join operations, the default data distribution scheme can result in significant network traffic as matching records from the two tables travel between worker nodes to meet. If one of the two tables was created with the DUPLICATE/REPEAT option, then every possible record from that table is available on every node to the other table. There is no need for any network traffic.

Small Table Operation Optimization

For small tables, even single table operations can perform better with repeated instead of divided distribution. LASR actually implemented the “High Volume Access to Smaller Tables” feature for the same reason. When a table is repeated, CAS runs any required operation on a single worker node against the full copy of the table that resides there, instead of distributing the work.

As stated, repeated tables can be implemented with the DUPLICATE data set option, it can also be implemented with the REPEAT option on the PROC CASUTIL LOAD statement. The CASUTIL method is shown below:

proc casutil ;
   load data=sashelp.prdsale outcaslib=”caspath”
           casout=”prdsale” replace REPEAT ;
quit ;

Extended Data Types (VARCHAR)

With Viya 3.2 comes SAS’ first widespread implementation of variable length character fields. While Base SAS offers variable length character fields through compression, Viya 3.2 is the first major SAS release to include a VARCHAR data type.

While VARCHAR’s main function is to save storage space, it also improves performance by reducing the size of the record being processed. CAS, like any other processing engine, will process narrower records more quickly than wide records.

User Defined Formats

User defined formats (UDFs) exist in CAS in much the same way they do in Base SAS. Their primary function, of course, is to provide display formatting for raw data values. Think about a format for direction. The raw data might be: “E”, “W”, “N”, “S” while the corresponding format values might be “East”, “West”, “North”, “South.”

So how might user defined formats improve performance in CAS? The same way they do in Base SAS, and the same way that VARCHAR does, by reducing the size of the record that CAS has to process. Imagine replacing multiple 200 byte description fields with 1 byte codes. If you had 10 such fields, the record length would decrease 1990 bytes ((10 X 200) – 10). This is an extreme example but it illustrates the point: User defined formats can reduce the amount of data that CAS has to process and, consequently, will lead to performance gains.

About Author

Stephen Foerster
Senior Manager, SAS Professional Services

Stephen Foerster is a Senior Manager in the Global Enablement and Learning (GEL) Team within SAS R&D's Global Technical Enablement Division. Stephen has almost 20 years experience designing and building data warehouses and decision support systems using SAS and other technologies. Stephen now uses that experience creating enablement material for SAS technologists worldwide.

12 Comments

Roy on April 16, 2018 5:35 am

Hi Stephen,
Thanks for sharing. Would it be helpful to increase the performance for Visual Analytics on Viya?
Thanks!
Roy
- Stephen Foerster on April 16, 2018 9:10 am
  
  Yes, absolutely. VA submits CAS actions behind the scenes so it will take advantage of any efficiencies implemented on the CAS tables it uses.
  - Roy on April 16, 2018 10:52 pm
    
    Thanks for the reply. Here is other question. How do I decide which partition key and order key should use for my VA report? Is there any rule to follow?
    Thanks!
    - Stephen Foerster on April 17, 2018 10:52 am
      
      Roy, the idea is to minimize the work CAS has to do at query time (CAS action time). If a VA bar chart is grouping by region and the underlying data is already grouped (partitioned) by region then CAS doesn't have to group it. So the CAS action comes back quicker. Similarly, if a VA line chart is plotting data by date and the underlying data is already ordered by date then CAS doesn't have to sort it. So again the CAS action comes back quicker.
      
      So partition by the most common grouping variables and order by the most common order-by variables.
      - Roy on April 18, 2018 3:08 am
        
        Thanks for your reply again! Since We are working on improve our customer's VA report performance on Viya and we have not found the way yet. Our customer's VA reports are full of filter object and list table,which means they didn't use the chart you mentioned before. Not sure if it wold be helpful for the VA report performance improvement, so would it be also helpful for filter object on VA report with the partition key?
        Thanks!
        
        Stephen Foerster on April 18, 2018 1:44 pm
        
        For filters, you want to use an index, not a partition key. Indexes were not available when I published this post but they are now. Index any field that is used in a filter. The relevant information is in this post:
        
        https://communities.sas.com/t5/SAS-Communities-Library/CAS-Indexes-First-Look-Viya-3-3/ta-p/447533
  - Roy on April 17, 2018 5:54 am
    
    Another question is that how can i use user defined format on Visual Analytics 8.x on VIYA?
    Thanks!
    - Stephen Foerster on April 17, 2018 11:08 am
      
      First you'll need to get the formats registered in CAS. Information on that is here, http://go.documentation.sas.com/?cdcId=calcdc&cdcVersion=3.3&docsetId=caldatamgmtcas&docsetTarget=n0jtz261h5qtg7n1j6ql02gd65lo.htm&locale=en.
      
      You can apply registered formats to tables at load time. If loading with the CASUTIL LOAD statement, you would use the VARS option, http://go.documentation.sas.com/?docsetId=casref&docsetTarget=n03spmi9ixzq5pn11lneipfwyu8b.htm&docsetVersion=3.3&locale=en#p1gvn79hc0o7izn1egadg659q6da.
      
      You can also apply formats to already-loaded CAS tables using the ALTERTABLE CAS action, http://go.documentation.sas.com/?docsetId=caspg&docsetTarget=cas-table-altertable.htm&docsetVersion=3.3&locale=en.
      - Roy on April 18, 2018 2:50 am
        
        Hi Stephen, Thanks for the reply. I have successfully created my user defined format on CAS, but I can't use it on my VA report after formatted my table on CAS. Here is the error when I import the CAS table on VA report. Not sure why I can't use it on my VA report.
        --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
        com.sas.cas.CASException: The user-defined programming statements could not be parsed. (severity=2 reason=6 statusCode=2710055)
        4 WARNING: License for feature 'TKCAS SAS Cloud Analytic Services Server' has expired and will stop working in 11 days. Contact your installation representative to obtain a renewal.
        4 WARNING: License for feature 'SAGEMID SAS Visual Analytics Explorer' has expired and will stop working in 11 days. Contact your installation representative to obtain a renewal.
        3 NOTE: Executing action 'simple.distinct'.
        5 ERROR: The user-defined programming statements could not be parsed.
        5 ERROR: Invalid FORMAT '$BRANCH_ID_F.' found.
        5 ERROR: Failure opening table 'B130082_OP_ORDER'
        5 ERROR: The action stopped due to errors.
        3 NOTE: Action 'simple.distinct' used (Total process time):
        3 NOTE: real time 0.000073 seconds
        3 NOTE: cpu time 0.086105 seconds (117952.05%)
        3 NOTE: total nodes 1 (8 cores)
        3 NOTE: total memory 117.88G
        3 NOTE: memory 0.00K (0.00%)
        debug=0x887ff837:TKCASTAB_PARSE_ERROR:
        --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
        Here is the sas code i used on SAS studio
        
        LIBNAME T cvp "/OA_share_folder";
        
        CAS mySession SESSOPTS=(messagelevel=all CASLIB=public TIMEOUT=999 LOCALE="en_US");
        CASLIB _ALL_ ASSIGN;
        
        proc format library=work.formats sessRef=mySession casFmtLib="SASFORMATS2";
        value $Branch_id_f
        "1"="A"
        "2"="B"
        "3"="C"
        "4"="D"
        "5"="E"
        "6"="F"
        ;
        
        run;
        
        cas mySession promoteFmtLib fmtLibName=SASFORMATS2 replace ;
        
        cas mySession saveFmtLib fmtLibName=SASFORMATS2 casLib=Formats
        table=SASFORMATS2 replace;
        
        data test ;
        set t.B130082_OP_ORDER;
        format Branch_id $Branch_id_f. func_type $func_type_f. Hosp_cnt_type $Hosp_cnt_type_f. Hosp_data_type $Hosp_data_type_f.;
        run;
        
        proc casutil;
        load data=test outcaslib="public" casout="B130082_OP_ORDER_stg"
        replace;
        quit;
        
        PROC CASUTIL;
        droptable casdata="B130082_OP_ORDER" incaslib="public";
        promote casdata="B130082_OP_ORDER_stg" casout="B130082_OP_ORDER" incaslib="public"
        outcaslib="public" ;
        quit;
        
        Stephen Foerster on April 18, 2018 11:42 am
        
        Nice work applying the format. Sorry for the bump in the road. Let's see if the format works in CAS but outside of VA. Can you try running the following in SAS Studio? (You can run all of these CAS actions together but I separated them to help isolate the problem if something fails)
        
        proc cas;
        table.columninfo / table="B130082_OP_ORDER" ;
        run ; quit ;
        
        proc cas;
        table.columninfo / table="B130082_OP_ORDER_stg" ;
        run ; quit ;
        
        proc cas; table.fetch /
        table={name='B130082_OP_ORDER',caslib='public'}
        /*fetchvars={{name='xxx',format='xxx'}}*/
        format=TRUE
        quit;
        
        proc cas; table.fetch /
        table={name='B130082_OP_ORDER_stg',caslib='public'}
        /*fetchvars={{name='xxx',format='xxx'}}*/
        format=TRUE
        quit;
        
        Roy on April 20, 2018 5:22 am
        
        HI Stephen, here are the results.
        
        proc cas;
        table.columninfo / table="B130082_OP_ORDER" ;
        run ; quit ;
        
        Results from table.columnInfo
        Column Information for B130082_OP_ORDER in Caslib Public
        Column Label Id Type Length Formatted
        Length Format Format
        Width Format
        Decimal
        BRANCH_ID 1 char 1 8 $F 0 0
        HOSP_DATA_TYPE 2 char 3 8 $F 0 0
        HOSP_CNT_TYPE 3 char 1 8 $F 0 0
        HOSP_ID_PLAIN 4 char 15 15 0 0
        HOSP_ABBR 5 char 30 30 0 0
        FEE_YM FEE_YM 6 double 8 11 F 11 0
        YR 7 double 8 12 0 0
        FUNC_TYPE 8 char 22 15 $F 0 0
        ORDER_CODE 9 char 22 22 0 0
        TYPE_C 10 char 21 21 0 0
        ORDER_QTY ORDER_QTY 11 double 8 27 F 27 2
        ORDER_DOT ORDER_DOT 12 double 8 26 F 26 0
        ORDER_QTY1 ORDER_QTY 13 double 8 27 F 27 2
        ORDER_DOT1 ORDER_DOT 14 double 8 26 F 26 0
        CURE_CNAME 15 char 60 60 0 0
        ATC5 16 char 27 27 0 0
        -----------------------------------------------------------------------------------------------------------------------------------------------------
        proc cas;
        table.columninfo / table="B130082_OP_ORDER_stg" ;
        run ; quit ;
        Results from table.columnInfo
        Column Information for B130082_OP_ORDER_STG in Caslib Public
        Column Label Id Type Length Formatted
        Length Format Format
        Width Format
        Decimal
        BRANCH_ID 1 char 1 8 $F 0 0
        HOSP_DATA_TYPE 2 char 3 8 $F 0 0
        HOSP_CNT_TYPE 3 char 1 8 $F 0 0
        HOSP_ID_PLAIN 4 char 15 15 0 0
        HOSP_ABBR 5 char 30 30 0 0
        FEE_YM FEE_YM 6 double 8 11 F 11 0
        YR 7 double 8 12 0 0
        FUNC_TYPE 8 char 22 15 $F 0 0
        ORDER_CODE 9 char 22 22 0 0
        TYPE_C 10 char 21 21 0 0
        ORDER_QTY ORDER_QTY 11 double 8 27 F 27 2
        ORDER_DOT ORDER_DOT 12 double 8 26 F 26 0
        ORDER_QTY1 ORDER_QTY 13 double 8 27 F 27 2
        ORDER_DOT1 ORDER_DOT 14 double 8 26 F 26 0
        CURE_CNAME 15 char 60 60 0 0
        ATC5 16 char 27 27 0 0
        ------------------------------------------------------------------------------------------------------------------------------------------------------------------
        
        proc cas; table.fetch /
        table={name='B130082_OP_ORDER',caslib='public'}
        fetchvars={{name='branch_id',format='branch_id_f'},{name='func_type',format='func_type_f'},{name='Hosp_cnt_type',format='Hosp_cnt_type_f'},{name='Hosp_data_type',format='Hosp_data_type_f'}};
        format=TRUE;
        quit;
        
        1 OPTIONS NONOTES NOSTIMER NOSOURCE NOSYNTAXCHECK;
        72
        73 proc cas; table.fetch /
        74 table={name='B130082_OP_ORDER',caslib='public'}
        75 fetchvars={{name='branch_id',format='branch_id_f'},{name='func_type',format='func_type_f'},{name='Hosp_cnt_type',format='
        75 ! Hosp_cnt_type_f'},{name='Hosp_data_type',format='Hosp_data_type_f'}};
        76 format=TRUE;
        NOTE: The submitted statements have been canceled.
        77 quit;
        
        NOTE: PROCEDURE CAS used (Total process time):
        real time 0.00 seconds
        cpu time 0.00 seconds
        
        proc cas; table.fetch /
        table={name='B130082_OP_ORDER_stg',caslib='public'}
        fetchvars={{name='branch_id',format='branch_id_f'},{name='func_type',format='func_type_f'},{name='Hosp_cnt_type',format='Hosp_cnt_type_f'},{name='Hosp_data_type',format='Hosp_data_type_f'}};
        format=TRUE;
        quit;
        1 OPTIONS NONOTES NOSTIMER NOSOURCE NOSYNTAXCHECK;
        72
        73 proc cas; table.fetch /
        74 table={name='B130082_OP_ORDER_stg',caslib='public'}
        75 fetchvars={{name='branch_id',format='branch_id_f'},{name='func_type',format='func_type_f'},{name='Hosp_cnt_type',format='
        75 ! Hosp_cnt_type_f'},{name='Hosp_data_type',format='Hosp_data_type_f'}};
        76 format=TRUE;
        NOTE: The submitted statements have been canceled.
        77 quit;
        
        NOTE: PROCEDURE CAS used (Total process time):
        real time 0.00 seconds
        cpu time 0.01 seconds
        
        Roy on April 27, 2018 5:08 am
        
        Hi Stephen,
        Did I using the wrong code to apply the format on VA designer?
        Thanks!

Blogs