SAS integration with Hadoop - one success story

Nearly every organization has to deal with big data, and that often means dealing with big data problems. For some organizations, especially government agencies, addressing these problems provides more than a competitive advantage, it helps them ensure public confidence in their work or meet standards mandated by law. In this blog I wanted to share with you how SAS worked with a government revenue collection agency to successfully manage their big data issues and seamlessly integrate with Hadoop and other technologies.

Hadoop Security

We all know Hadoop pretty well, and if you haven’t heard of Hadoop yet, it is about time you invest some resources to learn more about this upcoming defacto standard for storage and compute. The core of Apache Hadoop consist of a storage part known as HDFS (Hadoop Distributed File System) and a processing part (called MapReduce). Hadoop splits large files into large blocks and distributes them across the nodes of a cluster.

Hadoop was initially developed to solve web-scale problems like webpage search and indexing at Yahoo. However, the potential of the platform to handle big data and analytics caught the attention of a number of industries. Since the initial used of Hadoop was to count webpages and implement algorithms like page-rank, security was never considered a major requirement, until it started getting used by enterprises across the world.

Security incidents and massive fines have become commonplace and financial institutions, in particular, are doing everything to avoid such incidents. Security should never be an afterthought and should be considered in the initial design of the system. The five core pillars of Enterprise Security are as follows:

Read More »

Post a Comment

Securing sensitive data using SAS Federation Server at the data source level

Data virtualization is an agile way to provide virtual views of data from multiple sources without moving the data. Think of data virtualization as an another arrow in your quiver in terms of how you approach combining data from different sources to augment your existing Extract, Transform and Load ETL batch processes. SAS® Federation Server is a unique data virtualization offering that provides not only blending of data, but also on-demand data masking, encryption and cleansing of the data. It provides a central, virtual environment for administering and securing access to your Personally Identifiable Information (PII) and other data.

Data privacy is a major concern for organizations and one of the features of SAS Federation Server is it allows you to effectively and efficiently control access to your data, so you can limit who is able to view sensitive data such as credit card numbers, personal identification numbers, names, etc. In this three part blog series, I will explore the topic of controlling data access using SAS Federation Server. The series will cover the following topics:

Part 1: Securing sensitive data using SAS Federation Server at the data source level
Part 2: Securing sensitive data using SAS Federation Server at the row and column level
Part 3: Securing sensitive data using SAS Federation Server data masking

SAS Metadata Server is used to perform authentication for users and groups in SAS Federation Server and SAS Federation Server Manager is used to help control access to the data. In this blog, I want to explore controlling data access to specific sources of data using SAS Federation Server.  Obviously, you can secure data at its source by using secured metadata-bound libraries in SAS Metadata Server or by using a database’s or file’s own security mechanisms. However, SAS Federation Server can be used to control access to these data sources by authenticating with the users and groups in SAS Management Console and setting authorizations within SAS Federation Server Manager.

Read More »

Post a Comment

SAS Global Forum 2017 is closer to home, or should I say…

sasgf2017_globe_150x150-002est plus près de la maison, está más cerca de casa, está mais perto de casa, dichter bij huis, is closer to home, eh!

In analytics and statistics, we often talk about sample sizes. The size of the data sets that you analyze are a measure of the amount of information contained within those data. When observations are very similar or correlated due to study design, then the information added by having multiple (correlated) observations may be negligible. This is a common problem with clustered data; the information contained in clustered data is closer to the number of clusters than to the number of observations. As a result, study designers seek to measure many clusters.

When it comes to global presenters, SAS Global Forum is seeking more clusters.

Global representation at SAS Global Forum enriches the conference experience for all attendees, providing each of us with more innovation and information to advance the goals of our organizations.

However, we know that attending our conference from the far corners of the globe is expensive … but not as expensive as it used to be! We’ve got good news for SAS users who reside outside the contiguous 48 states of the United States (residents of Alaska, Hawaii, and U.S. territories, read this carefully!).

To ease the financial burden of travelling from afar to the conference, two new policies have been adopted by the SAS Global Users Group – largely in response to your concerns about cost.

Doubled discount for accepted contributed sessions

Each year, SAS Global Forum attracts about 700 proposed sessions from the user community. The review process is competitive as we can only accept 400 session talks. To attract even more submissions from around the globe, we’ve raised the registration discount from 25% to 50% for accepted proposals from the international user community. If you reside outside the 48 contiguous States, and your abstract is approved, you will automatically receive the 50% discount when you register.

As of the writing of this blog, SAS Global Forum 2017 will include four sessions from Africa, ninefrom Australia, 18 from Asia, 12 from South America and the Caribbean, 37 from Canada, 21 from Europe, and 23 from the United Kingdom. With this new policy, we expect far more in 2018 and beyond!

Read More »

Post a Comment

Modifying variable attributes in all datasets of a SAS library

Using the DATASETS procedure, we can easily modify SAS variable attributes such as name, format, informat and label:

proc datasets library=libref;
  modify table_name;
    format var_name = date9.;
    informat var_name = mmddyy10.;
    label var_name = 'New label';
    rename var_name = var_new_name;

We cannot, however, modify fixed variable attributes such as variable type and length.

Notice that we apply variable attributes modifications to one table at a time; that table name is specified in the modify statement of the proc datasets. There might be multiple modify statements within a single proc datasets to apply modifications to multiple data tables.

But what if we need to change variable attributes in all tables of a data library where that variable is present? Say, there is a data model change that needs to be applied consistently to all the data tables in that model.

Business case

Let’s consider a business case for a SAS programmer working for a pharmaceutical company or a contract research organization (CRO) managing data for a clinical trials project. Imagine that you need to apply some variable attributes modifications to all SDTM (Study Data Tabulation Model) and/or ADaM (Analysis Data Model) datasets to bring them in compliance with the CDISC (Clinical Data Interchange Standards Consortium) requirements. Obviously, “manually” going through dozens or hundreds of data tables in different domains to identify whether or not they contain your variable of interest is not an efficient solution.

Read More »

Post a Comment

Picture-in-Picture - It’s Not Just for Television Anymore

ProblemSolversWith fall comes cooler weather and, of course, football. Lots of football. Often times there will be two NFL games on that my husband wants to watch at the same time. Instead of flipping back and forth between two television stations, he can watch both games simultaneously, thanks to the picture-in-picture feature that we have on our television. This same concept works for SAS® ODS Graphics.

Have you ever been viewing two graphs across pages, flipping back and forth between the two and wishing you could see them together? Now you can. The Graph Template Language (GTL) and PROC SGRENDER enable you to produce a graph inside of a graph, similar to the picture-in-picture feature on your television.

The Game Plan

In this example, we are going to create a graph in the upper right corner of the axis area of a larger graph. When we define the GTL, we always start with the same GTL wrapper, as is shown below. In the wrapper below, INSET is the name of the GTL definition:

proc template;
define statgraph inset;
/* insert the code that produces the graphics output */

For demonstration purposes, we are going to use the SAS data set Sashelp.Heart and we are going to plot the variable CHOLESTEROL. The ENTRYTITLE statement defines the title for the graph. This statement is valid within the BEGINGRAPH block or after the last ENDLAYOUT statement. The plotting statements are contained within a LAYOUT block. In our example, we have enclosed the HISTOGRAM and DENSITYPLOT plotting statements inside a LAYOUT OVERLAY block. A standard axis is displayed with the BINAXIS=FALSE option in the HISTOGRAM statement. In the PROC SGRENDER statement, we point to the template definition, INSET, using the TEMPLATE option.

Read More »

Post a Comment

Use images in SAS Visual Analytics to enhance your report link

If your SAS Visual Analytics report requirements include linking out to separate reports without the need to pass values, you may want to consider using images to enhance the appearance of your base report. Here are three style examples using images that you can use depending on your report design requirements and report user preference:

1.     Visually appealing
2.     Generic
3.     Screenshot of actual report.

There is no better substitute for looking at examples so here are some screenshots for you:

1.     Visually appealing

Use images in SAS Visual Analytics

Read More »

Post a Comment

Using suggestion-based matching in SAS Data Quality

We all have challenges in getting an accurate and consistent view of our customers across multiple applications or sources of customer information. Suggestion-based matching is a technique found in SAS Data Quality to improve matching results for data that has arbitrary typos and incorrect spellings in it. The suggestion-based concept and benefits were described in a previous blog post. In this post, I will expand on the topic and show how to build a data job that uses suggestion-based matching in DataFlux Data Management Studio, the key component of SAS Data Quality and other SAS Data Management offerings. This article takes a simple example job to illustrate the steps needed to configure suggestion-based matching for person names.

In DataFlux Data Management Studio I first configure a Job Specific Data node to define the columns and example records that I’d like to feed into the matching process. In this example, I use a two column data table made up of Rec_ID and a Name column and sample records as shown below.


To build the suggestion-based matching feature, I have to insert and configure at least a Create Match Codes node, a Clustering Node and a Cluster Aggregation node in the data job.

Read More »

Post a Comment

Calling SAS Data Quality jobs from Python

With DataFlux Data Management 2.7, the major component of SAS Data Quality and other SAS Data Management solutions, every job has a REST API automatically created once moved to the Data Management Server. This is a great feature and enables us to easily call Data Management jobs from programming languages like Python. We can then involve the Quality Knowledge Base (QKB), a  pre-built set of data quality rules, and do other Data Quality work that is impossible or challenging to do when using only Python.

calling-sas-data-quality-jobs-from-pythonIn order to make a RESTful call from Python we need to first get the REST API information for our Data Management job. The best way to get this information is to go to Data Management Server in your browser where you’ll find respective links for:

  • Batch Jobs
  • Real-Time Data Jobs
  • Real-Time Process Jobs.

From here you can drill through to your job REST API.

Alternatively, you can use a “shortcut” to get the information by calling the job’s REST API metadata URL directly. The URL looks like this:

http://<DM Server>:<port>/<job type>/rest/jobFlowDefns/<job id>/metadata

calling-sas-data-quality-jobs-from-python02 Read More »

Post a Comment

Use Rank in SAS Visual Analytics to display the last date, month or rolling window

Requirements that are the most easily described can often be the most difficult to implement. I’m referring to requests like:

  • Display a gauge with the most recently collected metric.
  • Plot a 18 month rolling window of profit.
  • Display last month’s products percent of total metrics for visual comparison.

Okay, so these are pretty specific requests, which I built a report to answer, but none the less, requirements like these do exist.

Use Rank in SAS Visual Analytics

So, how do you implement these requests? Use rank! You might be wondering how this is possible since the rank feature requires a numeric value and these requirements are based on dates. Solution: use the TreatAs function. Let’s break it down step by step.

But first, here is a breakdown of the report objects used in this report. Notice that this report contains a section prompt via a button bar which prompts the user to select a Product Line. This section prompt filters all of the other objects by that Product Line value.

Read More »

Post a Comment

Macro variables that provide information about your SAS® environment

ProblemSolversHave you ever needed to run code based on the client application that you are using? Or have you needed to know the version of SAS® software that you are running and the operating system that you are running it on? This blog post describes a few automatic macro variables that can help with gathering this information.

Application Name

You can use the &_CLIENTAPP macro variable to obtain the name of the client application. Here are some details:

  • Referencing &_CLIENTAPP in SAS® Studio returns a value of SAS Studio
  • Referencing &_CLIENTAPP in SAS® Enterprise Guide® returns a value of ‘SAS Enterprise Guide
    Note: The quotation marks around SAS Enterprise Guide are part of the value.

Program Name

You can use the &SYSPROCESSNAME macro variable to obtain the name of the current SAS process. Here are some details:

  • Referencing &SYSPROCESSNAME interactively within the DMS window returns a value of DMS Process
  • Referencing &SYSPROCESSNAME in the SAS windowing environment of your second SAS session returns a value of DMS Process (2)
  • Referencing &SYSPROCESSNAME in SAS Enterprise Guide or SAS Studio returns a value of Object Server
  • Referencing &SYSPROCESSNAME in batch returns the word Program followed by the name of the program being run (for example: Program 'c:\')
    Note: For information about other techniques for retrieving the program name, see SAS Note 24301: “How to retrieve the program name that is currently running in batch mode or interactively.”


The following code illustrates how you can use both of these macro variables to check which client application you are using and display a message in the SAS log based on that result:

Read More »

Post a Comment