An overview of SAS Data Quality 3.3 programming capabilities

0

The release of SAS Viya 3.3 has brought some nice data quality features. In addition to the visual applications like Data Studio or Data Explorer that are part of the Data Preparation offering, one can leverage data quality capabilities from a programming perspective.

For the time being, SAS Viya provides two ways to programmatically perform data quality processing on CAS data:

  • The Data Step Data Quality functions.
  • The profile CAS action.

To use Data Quality programming capabilities in CAS, a Data Quality license is required (or a Data Preparation license which includes Data Quality).

Data Step Data Quality functions

The list of the Data Quality functions currently supported in CAS are listed here and below:

SAS Data Quality 3.3 programming capabilities

They cover casing, parsing, field extraction, gender analysis, identification analysis, match codes and standardize capabilities.

As for now, they are only available in the CAS Data Step. You can’t use them in DS2 or in FedSQL.

To run in CAS certain conditions must be met. These include:

  • Both the input and output data must be CAS tables.
  • All language elements must be supported in the CAS Data Step.
  • Others.

Let’s look at an example:

cas mysession sessopts=(caslib="casuser") ;
 
libname casuser cas caslib="casuser" ;
 
data casuser.baseball2 ;
   length gender $1 mcName parsedValue tokenNames lastName firstName varchar(100) ;
   set casuser.baseball ;
   gender=dqGender(name,'NAME','ENUSA') ;
   mcName=dqMatch(name,'NAME',95,'ENUSA') ;   
   parsedValue=dqParse(name,'NAME','ENUSA') ;
   tokenNames=dqParseInfoGet('NAME','ENUSA') ;
   if _n_=1 then put tokenNames= ;
   lastName=dqParseTokenGet(parsedValue,'Family Name','NAME','ENUSA') ;
   firstName=dqParseTokenGet(parsedValue,'Given Name','NAME','ENUSA') ;
run ;

Here, my input and output tables are CAS tables, and I’m using CAS-enabled statements and functions. So, this will run in CAS, in multiple threads, in massively parallel mode across all my CAS workers on in-memory data. You can confirm this by looking for the following message in the log:

NOTE: Running DATA step in Cloud Analytic Services.
NOTE: The DATA step will run in multiple threads.

I’m doing simple data quality processing here:

  • Determine the gender of an individual based on his(her) name, with the dqGender function.
  • Create a match code for the name for a later deduplication, with the dqMatch function.
  • Parse the name using the dqParse function.
  • Identify the name of the tokens produced by the parsing function, with the dqParseInfoGet function.
  • Write the token names in the log, the tokens for this definition are:
    Prefix,Given Name,Middle Name,Family Name,Suffix,Title/Additional Info
  • Extract the “Family Name” token from the parsed value, using dqParseTokenGet.
  • Extract the “Given Name” token from the parsed value, again using dqParseTokenGet.

I get the following table as a result:

Performing this kind of data quality processing on huge tables in memory and in parallel is simply awesome!

The dataDiscovery.profile CAS action

This CAS action enables you to profile a CAS table:

  • It offers 2 algorithms, one is faster but uses more memory.
  • It offers multiple options to control your profiling job:
    • Columns to be profiled.
    • Number of distinct values to be profiled (high-cardinality columns).
    • Number of distinct values/outliers to report.
  • It provides identity analysis using RegEx expressions.
  • It outputs the results to another CAS table.

The resulting table is a transposed table of all the metrics for all the columns. This table requires some post-processing to be analyzed properly.

Example:

proc cas; 
   dataDiscovery.profile /
      algorithm="PRIMARY"
      table={caslib="casuser" name="product_dim"}
      columns={"ProductBrand","ProductLine","Product","ProductDescription","ProductQuality"}
      cutoff=20
      frequencies=10
      outliers=5
      casOut={caslib="casuser" name="product_dim_profiled" replace=true}
   ;
quit ;

In this example, you can see:

  • How to specify the profiling algorithm (quite simple: PRIMARY=best performance, SECONDARY=less memory).
  • How to specify the input table and the columns you want to profile.
  • How to reduce the number of distinct values to process using the cutoff option (it prevents excessive memory use for high-cardinality columns, but might show incomplete results).
  • How to reduce the number of distinct values reported using the frequencies option.
  • How to specify where to store the results (casout).

So, the result is not a report but a table.

The RowId column needs to be matched with the documentation to understand how to read the results. There are around 40 metrics at the column level plus 12 at the table level.

A few comments/cautions on this results table:

  • DoubleValue, DecSextValue, or IntegerValue fields can appear on the output table if numeric fields have been profiled.
  • DecSextValue can contain the mean (metric #1008), median (#1009), standard deviation (#1022) and standard error (#1023) if a numeric column was profiled.
  • It can also contain frequency distributions, maximum, minimum, and mode if the source column is of DecSext data type which is not possible yet.
  • DecSext is a 192-bit fixed-decimal data type that is not supported yet in CAS, and consequently is converted into a double most of the time. Also, SAS Studio cannot render correctly new CAS data types. As of today, those metrics might not be very reliable.
  • Also, some percentage calculations might be rounded due to the use of integers in the Count field.
  • The legend for metric 1001 is not documented. Here it is:

1: CHAR
2: VARCHAR
3: DATE
4: DATETIME
5: DECQUAD
6: DECSEXT
7: DOUBLE
8: INT32
9: INT64
10: TIME

A last word on the profile CAS action. It can help you to perform some identity analysis using patterns defined as RegEx expressions (this does not use the QKB).

Here is an example:

proc cas; 
   dataDiscovery.profile /
      table={caslib="casuser" name="customers"}
      identities={
         {pattern="PAT=</span>?999[<span style=" />-]? ?999[- ]9999",type="USPHONE"}, 
         {pattern= "PAT=^99999[- ]9999$",type="ZIP4"}, 
         {pattern= "PAT=^99999$",type="ZIP"}, 
         {pattern= "[^ @]+@[^ @]+\.[A-Z]{2,4}",type="EMAIL"}, 
         {pattern= "^(?i:A[LKZR]|C[AOT]|DE|FL|GA|HI|I[ADLN]|K[SY]|LA|M[ADEINOST]|N[CDEHJMVY]|O[HKR]|PA|RI|S[CD]|T[NX]|UT|V[AT]|W[AIVY])$",type="STATE"}
      }
      casOut={caslib="casuser" name="customers_profiled" replace="true"}
   ;
quit ;

In this example that comes from the documentation, identities are specified to identify data that looks like phone, zip, email, and state.

You can observe the results of the identity analysis in the metric #1028 as shown below:

I hope this post has been helpful.

Thanks for reading.

Share

About Author

Nicolas Robert

Principal Technical Architect

Nicolas Robert is a Principal Technical Architect in the Global Enablement and Learning (GEL) Team within SAS R&D's Global Technical Enablement Division. He has been at SAS since 1998, serving different roles in Technical Support, PSD and Pre-Sales. His primary focus is on Data Management.

Related Posts

Leave A Reply

Back to Top