Data Governance Series: Share DataFlux Data Quality profiling metrics in SAS Visual Analytics

Share DataFlux Data Quality profiling metrics in SAS Visual AnalyticsThis is my first blog post, and the first of a long series around Data Governance. The first thing I want to discuss is the ability to Share DataFlux Data Quality profiling metrics in SAS Visual Analytics.

This post will illustrate how to extract profiling data from the DataFlux repository and how to use SAS Visual Analytics to distribute profiling results centrally and widely through an organization.

To enable data profiling visualization in SAS Visual Analytics, you need to have SAS Visual Analytics, either the full version in addition to any solution containing DataFlux Data Management Studio or Server, or the VAAR (Visual Analytics Administration & Reporting) component that is included, for example, in SAS Data Governance, SAS Data Quality Advanced, SAS Data Management Advanced and SAS MDM Advanced.

Overall process

The process is a 2-step process:

  • Extract data from DataFlux repository and build a data quality profiling datamart
    Load the data quality profiling datamart into memory and plug VA reports on it

Here is the master DataFlux Process Job that orchestrates the 2 steps.

profiling processes results in SAS Visual Analytics

So, let’s start. The first thing to do is to export data profiling results from the DataFlux repository to a set of tables that will be the base of our reports. SAS Data Management Studio does not provide special nodes to query easily the repository for profiling results, as it does for monitoring results. In this case, we will use SQL queries to extract required data to build the necessary reports.

Repository Data Model

To go further, we need to know a bit more about the DataFlux Repository physical data model. There are (at least) two ways to describe it:

First way (preferred): in the Data Management Studio Options window (Tools -> Data Management Studio Options -> General -> Save Repository Schema…)

Data Quality profiling2

The resulting file is a very useful text file that describes all repository profile and monitor tables, and their relations.

Data Quality profiling3

Second way (more technical): in the Data Management Studio New Repository Definition window (Administration -> Repository Definitions -> New Repository Definition… -> Database Connection -> Save Repository DDL…)

Data Quality profiling4

NB: you can cancel the creation of the repository once you get your DDL file.

The resulting file is an SQL script enabling the creation of the repository by an admin directly in the target DBMS. It describes all the repository tables but might be hard to read.

Data Quality profiling5

Now that you know more about how to go further with the repository data model, we can move to the data quality profiling datamart build.

1 – Extract data from DataFlux repository and build a data quality profiling datamart

This phase is done using a DataFlux Process Job that combines multiples DataFlux Data Jobs. Here are the steps:

  • Get common information and field metrics from a join between the following repository tables:
    • DSN (Records each ODBC DSN in the job),
    • FLD (Records metrics for each field in a table),
    • TBL (Records each table that is processed),
    • JOB (Records each run of a profile job),
    • REPORT (Records Profile reports),
    • JSETUP (Job setup root table)
  • Get frequency distributions from a dynamic select of FRQDST<xxx> (Frequency distribution tables; 1 table per field per execution)
  • Get pattern distributions from PATDST (Pattern frequency distribution)
  • Get outliers from OUTLR (Min/Max Outliers)
  • Get percentiles from PTILE (Percentiles)

Data Quality profiling6

NB: The other steps (Table/Field notes, Alerts, Business Rules, Custom Metrics) are not implemented right now in my data quality profiling jobs and VA reports but might be used in the future.

As a result, we’ve got 5 tables that shape our data quality Profiling datamart:

  • PFO_FLD: master table containing summary results for all fields for all executions
  • PFO_FRQDST: frequency distributions for all fields for all executions
  • PFO_OUTLR: outliers for all fields for all executions
  • PFO_PATDST: pattern frequency distributions for all fields for all executions
  • PFO_PTILE: percentiles for all fields for all executions

NB: It was not possible to model only 1 table for all needs because of multiple 1-N relations. This is not a problem because VA is able to build single reports based on multiple data sources.

The first 2 nodes of this process job prepare the environment:

  • Define Variables Here: Initialization of macro variables (where to store the datamart; number of values to keep for frequency distributions and pattern frequency distributions)
  • FILE Storage Management: adapt the DSN properties even if the repository is a database file repository ( a file with an .rps extension)

This Process Job is “self-repository” based. That is to say, it extracts data quality profiling results from the repository where this Process Job is running. You just have to put this job in whatever repository you want to extract results from. Database file repositories are supported.

This Process Job works only in REPLACE mode. The data quality profiling datamart is overwritten each time it runs. No append or delta mode. All the data quality profiling results need to be kept in the original DataFlux repository.

Next time I’ll talk about the second part: load the data quality profiling datamart into memory and use SAS Visual Analytics to share data quality profiling results wisely.

Concluding note:

Querying the profile repository, while useful, must be used with the understanding that the profile repository schema does change (though it’s rare).

Post a Comment

New year resolution: Don't let old stuff slow down your SAS Grid

keep your SAS Grid from slowing downAs another year goes by, many people think about new year’s resolutions. It’s probably the third year in row that I’ve promised myself I’d start exercising in the fabulous SAS gym. Of course, I blame the recently concluded holiday season. With all the food-focused events, I couldn’t resist, ate way too much and now feel like I can barely move. The same can happen to a SAS Grid environment, as I learned by helping some colleagues that were debugging a SAS Grid that “ate” too many jobs and often refused to move. In this blog I’ll share that experience in the hopes that you can learn how to keep your SAS Grid from slowing down.

The issue

The symptoms our customer encountered were random “freezes” of their entire environment without any predictable pattern, even with no or few jobs running on the SAS Grid, LSF daemons stopped responding for minutes. When it happened, not only did new jobs not start, but it was also impossible to query the environment with simple commands:

$ bhosts

LSF is processing your request. Please wait ...
LSF is processing your request. Please wait ...
LSF is processing your request. Please wait ...

Then, as unpredictably as the problem started, it also self-resolved and everything went back to normal… until the next time.

Eventually we were able to find the culprit. It all depends on the way mbatchd, the Master Batch Daemon, manages its internal events file: lsb.events.

Let’s see what this file is, and why it can cause troubles.

Read More »

Post a Comment

Accessing the provided Relationship Report in SAS Visual Analytics 7.3

In SAS Visual Analytics 7.3, the usage reports accessed from the administrator includes a relationship report.  This report provides very basic lineage and dependency information for selected object types.  The data to populate this report comes from the middle-tier relationship service.  Once you enable the collection and extraction of this data, the data gets autoloaded to the default LASR Analytics Server by the same autoload process that loads the audit data.

If you are not already familiar with the autoload process for the audit report, you can also view my YouTube video here on the same subject.

When you access the Usage Reports from the administrator, you will see that that a Relationship report now appears in the Usage folder.

Relationship Report in SAS Visual Analytics

To enable the Relationship report for viewing, in addition to having the autoload process scheduled and running, in SAS Management console, you must also access Application Management–>Configuration Manager–>SAS Application Infrastructure–>Web Infra Platform Services 9.4–>RelationshipContentService properties and make sure that Scheduling for Load Task Enabled is set to true on the Settings tab.  This will enable the scheduled collection of relationship data. To enable the periodic extraction of the data, in SAS Management console, access Application Management–>Configuration Manager–>SAS Application Infrastructure–>Visual Analytics 7.3 and set the va.extractRelationshipData Advanced property to true.

To put these property settings into effect, you must restart the SAS Web Application Server.

The first scheduled run of the data load should occur shortly after the server restarts.  The task runs every three hours.  After the load task runs, you will see  both relationships_visualanalytics.csv and relationships_visualanalytics.sas7bdat in the EVDMLA directory under config/Lev1/AppData/SASVisualAnalytics/VisualAnalyticsAdministrator/AutoLoad.

When the Visual Analytics autoload process has run, you should see the RELATIONSHIPS_VISUALANALYTICS table loaded to the LASR Analytic Server:

Relationship_Report_in_SAS_Visual_Analytics2

The first report that displays is the list of Reports and Explorations, but you can click the icon to select Tables, Most References Names, or Least Referenced Objects.

Relationship_Report_in_SAS_Visual_Analytics3

Clicking on a report object displays the related objects:

Relationship_Report_in_SAS_Visual_Analytics4

You can access the Tables report and click on a table to display all objects that have a dependency on the selected table.

Relationship_Report_in_SAS_Visual_Analytics5

Accessing the Most Referenced Names report and selecting the Report button displays a word cloud, enabling you to click on a report name to display the dependencies for the report name.  The 20 most referenced objects of each type is displayed.

Relationship_Report_in_SAS_Visual_Analytics6

Accessing the Least Referenced Objects report displays the least referenced objects, along with the percent associated with their access frequency.

As I said earlier, these supplied reports provide very basic lineage information.  See the blog here written by my colleague, Gerry Nelson, to see how to use the batch relationship reporting tools for more detailed relationship reporting.

Post a Comment

Super Users in the SAS Support Communities

There are SAS users, and then there are SAS Super Users in the SAS Support Communities.

Super Users in the CommunitiesWhat makes a communities user "Super"? It's the Super Users that fly around the discussion boards -- answering questions and guiding other community members. They aren't just SAS experts; they are teachers as well, helping new users to "learn to fish" and find their own answers.

The SAS Support Communities Super User program recognizes the strongest members for their outstanding contributions. In addition to the public honor of the "Super User" designation, inductees are given tools that help them to expand their community leadership. We hope this helps them get even more out of their community experience. Certainly, their actions help the rest of us in the community as well.

We asked our Super Users what they enjoy most about the communities. Here's some of what they said:

  • "There is a wealth of expertise available for free."
  • "You can participate any time of day or night."
  • "I love when you get an interesting question and multiple people post different answers trying to one-up each other."
  • "It's an opportunity to share. Most of the answers I provide take me very little time to prepare, and I know programmers can struggle as they are learning."

Congratulations to our inaugural group! These 6 outstanding community members will enjoy the official Super User designation for the next 6 months. They are:

Sign in today to SAS Support Communities and see these SAS Super Users at work. And maybe one day, you can become a Super User too!

Post a Comment

Introducing data-driven loops

Programming loops are a staple of any computer programming language. In essence, loops are what make programming languages so powerful.

In all programming loops, a group of statements is executed repeatedly based on the value of an index variable that changes from one iteration to another. The iterations are continued until a certain condition is met. However, in ordinary loops that index variable’s values are at least partially embedded in the programming code.

For example in the SAS’ iterative do statement embedded (hard coded) in the code are initial value 1, upper boundary 21 and increment 3:

do i=1 to 21 by 3;

Still, it is very convenient in case of linear pattern of incrementing or decrementing the index, and convenience trumps anathema of the programming – hardcoding.

If we don’t have a pattern of index variable alteration, SAS graciously provides a list of index variable values without any pattern in its iterative do-statement. For example:

do i=1, 5, 10, 8, 7;

do c=’Volvo’, ’GM’, ’Honda’, ’Ford’, ’Toyota’;

Even though the index variables can be of any data type (numeric or character), embedding their values into the do statement makes a program hard-coded (which is a curse word in programming). That requires change of the code every time when the index variable values or their order change.

SAS goes even further, allowing us to use variable names in the list of index values:

do c=var1, var2, var3, var4;

But if your data step is processing a long table, adding new variables to it with duplicate values across its observations is not the most efficient way of doing it.

While loops in general and SAS do-loops in particular belong to the code class, index variable values are data and as such ideally should not belong to the code.

Wouldn’t it be nice to have those index variable values removed from the code altogether and have them pulled from some external data source? By answering “yes” to this question we arrive at the notion of a data-driven loop.

Read More »

Post a Comment

SAS Global Forum Executive Board: Shaping the conference experience for SAS users

SAS Global ForumSAS Global Forum is the premier worldwide event for SAS professionals. It brings nearly 5,000 users and executives from across the world together to share new techniques and technologies to help expand users’ SAS skills. But an event of this size (hundreds of presentations, dozens of workshops and several networking opportunities) doesn’t just happen.  While SAS Global Forum is supported by SAS, the event is entirely planned and run by volunteers of the SAS Global Users Group and governed by an Executive Board charged with shaping the conference experience for fellow SAS users.

A Little History

Volunteers have been running the event since, well, before it was an event. It can be argued that the first event of SAS users occurred in 1974, when a two-hour “birds-of-a-feather” session of SAS software users was held at the annual IBM Share meeting. The following summer, SAS users met again for two days at Abbott Labs. These two gatherings were so successful that the attendees decided to plan an annual, international meeting of SAS users and the first annual International SAS Users Conference was held in Kissimmee, Florida, in January 1976.

After this initial experience, this group of volunteers realized that they simply didn’t have the resources to completely execute a large conference. They asked SAS to help with mailings and print collateral and formed the original SAS Global Forum Executive Board to handle planning and execution. The board would select a chair from its members. Eventually, every member of the original board served as chair. To expand the pool of possible chairs, the board began selecting new chairs from users who were actively using SAS and regularly attending the annual conference. These new chairs would serve on the Executive Board for terms before their respective conference years.

Today the Executive Board consists of 30 members and five SAS employees. These SAS experts have decades of technical and SAS expertise and are recognized as leaders in their chosen fields. Most importantly, though, they are fiercely dedicated to SAS and the user community.

The Role of the Executive Board

The goals of the Executive Board haven’t changed from those early days. They are charged with providing the SAS user community opportunities to:

  • Enhance skills with SAS content and professional development.
  • Increase understanding of all products and services provided by SAS.
  • Gain access to a network of peers for collaboration and information exchange.
  • Provide opportunities for feedback to SAS about their products, services, and policies.

SASGlobalForumExecutiveBoardIt’s a role the Executive Board takes very seriously. “My favorite part of being an Executive Board member is having the chance to make a difference for our attendees,” says Lori Griffin, the 2009 conference chair.   “Listening to what users are looking for in their conference experience and working to make that experience the best it can be is very rewarding.”

Executive Board Member and 2003 chair Art Carpenter agrees. “Being a member of the Executive Board provides me an opportunity to influence the direction and content of the SAS Global Forum Conference.”

Read More »

Post a Comment

Nominations for Chair of SAS Global Forum 2019 being accepted through Feb. 8

SAS Global ForumSAS Global Forum is an all-volunteer event, with SAS users from around the globe serving in a variety of roles to make the event happen. Still, no position is more impactful to the success of the conference than the SAS Global Forum Conference Chair. Its importance is highlighted by the fact that the chair position is filled approximately three years in advance of the event. So if you do the math, and even I can do this math, it’s time to receive nominations and ultimately select the chair for SAS Global Forum 2019. Nominations are currently being accepted and will continue through Feb. 8.

This much lead time might seem like overkill, but the chair utilizes every bit of it. (BTW, the conference chair also has duties in the years following the conference, so it really is a multi-year appointment!) At three years out, the chair is shadowing other conference chairs and reviewing surveys and data to determine what’s working or might need more focus. At two years out, the chair develops and refines their conference plans, themes, goals, etc. The year of the conference is spent working with the Global Forum Conference Team on content planning, attracting speakers, encouraging submissions and executing the event. Through all this lead up, the chair is participating on the SAS Global Users Group Executive Board and attending conference planning meetings. After the conference and in future years the chair continues to be an active member of the Executive Board.

Clearly, it’s a special SAS professional who can successfully serve as conference chair and many, many hours go into vetting candidates, interviewing them and deciding on the best candidate to fill this role.

Chair of SAS Global Forum - Debbie BuckFormer chair and current Executive Board member Debbie Buck is serving as Chair of the Nominations Committee for the 2019 search. Debbie is currently a Senior Statistical Programmer/Analyst with Chiltern and was an independent statistical consultant and an instructor for SAS for many years. She was also previously with Theorem Clinical, the Washington University School of Medicine, and Louisiana State University. Over the past 30 years, she’s served in a number of volunteer roles for the SAS Users Groups at the local, in-house, regional and international levels. She served as chair for a number of events, including SCSUG 1997, SSU 2001 (the Joint Conference of the SouthEast and South Central SAS Users Groups) and SAS Global Forum 2011, held at Caesars Palace in Las Vegas, NV.

Recently, I asked Debbie to share a little more information about the nomination process.

Read More »

Post a Comment

What is going on with my Visual Analytics audit data archiving?

In my last blog (What is going on with my Visual Analytics audit data collection) I reviewed how middle-tier auditing works and where to look for logs to debug issues. In this follow up blog I will do the same for the Visual Analytics audit data archiving process. The archive process is important, to quote the Visual Analytics Administration guide “CAUTION! Audit data can consume significant amounts of disk space and processing capacity.”

By default middle-tier auditing includes logon and logoff records. A SAS Deployment will always record logon and logoff activity in the audit tables. When Visual Analytics auditing is enabled a lot of additional activity is recorded, and the size of the audit tables in the Shared Services database can grow quickly. The first step in managing the space that is consumed is to archive data from the audit tables. The archive process is documented in the SAS Middle Tier Administration Guide.

To determine what records to archive an archive job reads the archive rules in the SAS_AUDIT_ARCHIVE_RULE table. The archive job always starts when SAS Web Infrastructure Platform Services starts.  In addition, the default archive job is scheduled to start every Monday at the start of the day, but the archive job schedule can be configured. To change the frequency of the archive see the SAS Middle Tier Administration Guide .

Read More »

Post a Comment

Junior Professional Program brings the SAS Global Forum experience to new users

SmallGraphicThis April, thousands of SAS users will converge on Las Vegas for the premier event for SAS professionals: SAS Global Forum. The event will attract users from across the globe; in every industry imaginable; and from countless government and academic institutions. It’s an excellent place to network with and learn from users of all skill levels, with professionals with 40 years of SAS experience to those just getting started in attendance. To help those relatively new to SAS experience the conference for the first time, the SAS Global Users Group Executive Board put together the Junior Professional Award program.

The program is designed exclusively for professionals who have used SAS on the job for three years or less, have never attended SAS Global Forum, and whose circumstances would otherwise keep them from attending. And, don’t let the word “junior” confuse you. The award is available for full-time SAS professionals, regardless of age. (In fact, it’s perfect for individuals making a mid-life career change or those who recently started using SAS in their job.) You also don’t have to be a SAS programmer. If you’re using SAS as a researcher, systems administrator, analyst or really any line of work, you would qualify.

The Junior Professional award provides user with free conference registration, free training, and a number of opportunities to learn from and network in the larger SAS community. While the program doesn’t cover all costs (travel and lodging are not included), it does remove some of the financial barriers newer SAS users may have in getting approval to attend.

To apply, users simply need to fill out the online application form. Award applications must be received by January 31, 2016. Questions can be directed to the Junior Professional Program Coordinator.

VirkudATo learn more about the award and its benefits, I recently sat down with one of last year’s winners, Arti Virkud.

Larry LaRusso: Hello Arti. First of all, congratulations on winning one of last year’s Junior Professional awards. Tell me how you first learned about the program?

Arti Virkud: I learned about the award from my manager and a colleague at the New York City Department of Health, both of whom were previous award recipients. We have tried really hard to spread the word about what an incredible opportunity this is for SAS professionals on our team.

Read More »

Post a Comment

New year refresher on reshaping data

ProblemSolversAs support analysts in the SAS Technical Support division, we answer many phone calls from SAS customers.  As members of the SAS Foundation team, we get questions that vary significantly in content from all of the areas that we support.  We offer coding tips and suggestions as well as point customers to SAS Notes and documentation. A common question that we are frequently asked is how to collapse a data set with many observations for the same BY variable value into a data set that has one observation per BY variable value.  Another way to phrase this question is: how do you reshape the data from “long to wide?”

Resources to Reshape Data

The following SAS Note illustrates how to use two TRANSPOSE procedure steps to collapse the observations for multiple BY variables:

Sample 44637: Double PROC TRANSPOSE method for reshaping your data set with multiple BY variables

If you prefer to use the DATA step rather than PROC TRANSPOSE, the following SAS Note provides a code sample that accomplishes the same purpose:

Sample 24579: Collapsing observations within a BY-Group into a single observation (when data set has 3 or more variables)

If the data set is “wide” and you’d like to expand each observation into multiple observations to make it a “long” data set, the following SAS Note is helpful:

Sample 24794: Expand single observations into multiple observations

Brief Overview of Some Support.sas.com Resources

Since we’ve been discussing SAS Notes from the support.sas.com site, here is a brief overview of how to use the site to find other helpful items for your future coding projects.

Read More »

Post a Comment