Fun with SAS Text Analytics: A qualitative analysis of IALP papers

Fun with Text AnalyticsLast week, I attended the IALP 2016 conference (20th International Conference on Asian Language Processing) in Taiwan. After the conference, each presenter received a u-disk with all accepted papers in PDF format. So when I got back to Beijing, I began going through the papers to extend my learning. Usually, when I return from a conference, I go through all paper titles and my conference notes, then choose the most interesting articles and dive into them for details. I’ll then summarize important research discoveries into one document. This always takes me several days or more to complete.

This time, I decided to try SAS Text Analytics to help me read papers efficiently. Here’s how I did it.

My first experiment was to generate a word cloud of all papers. I used these three steps.

Step 1: Convert PDF collections into text files.

With the SAS procedure TGFilter and SAS Document Conversion Server, you may convert PDF collections into a SAS dataset. If you don’t have SAS Document Conversion Server, you can download pdftotext for free. Pdftotext converts PDFfiles into texts only, you need to write SAS code to import all text files into a dataset. Moreover, if you use pdftotext, you need to check if the PDF file is converted correctly or not. It’s annoying to check texts one by one and I hope you look for smart ways to do this check. SAS TGFilter procedure has language detection functionality and language of any garbage document after conversion is empty rather than English, so I recommend you use TGFilter, then you can filter garbage documents out easily with a where statement of language not equal to ‘English.’

Step 2: Parse documents into words and get word frequencies.

Run SAS procedure HPTMINE or TGPARSE against the document SAS dataset, with stemming option turned on and English stop-word list released by SAS, you may get frequencies of all stems.

Step 3: Generate word cloud plot.

Once you have term frequencies, you can either use SAS Visual Analytics or use R to generate word cloud plot. I like programming, so I used SAS procedure IML to submit R scripts via SAS.

These steps generated a word cloud with the top 500 words of 66 papers. There were a total of 87 papers and 21 of them could not be converted correctly by SAS Document Conversion Server. 19 papers could not be converted correctly by pdftotext.

fun-with-sas-text-analytics

Figure-1 Word Cloud of Top 500 Words of 66 Papers

Read More »

Post a Comment

Making data personal: Big data made small

sasgf2017_globe_150x150-002

Editor's note: Amanda Farnsworth is Head of Visual Journalism at BBC News and a featured speaker at SAS Global Forum 2017, April 2-5, 2017 in Orlando.

My days are spent trying to put the best content we can in front of our loyal, heartland audience, while reaching out to others, particularly on social media, who may never usually come to the BBC for their news.

It can sometimes be hard to reach both audiences at the same time.

But recently we hit on a format that does exactly that. We call it The Personal Relevance Calculator. We have made a whole series of these calculators on different topics, including “The Great British Class Calculator” (yes we Brits are still obsessed with Class!) and “Will a Robot Take Your Job.

The idea is to take a big data set that tells a story and make it personally relevant to each and every user. Readers simply enter a small amount of personal information – it could be their age or height and weight, or a postcode of where they live – and the result they get back from the calculator is unique, or appears to be unique, to them.  This result is given in a rich, visual way and is very shareable on social media.

The advantages are a much deeper engagement in the subject than we might get by writing a traditional article and they are usually very popular, getting millions of hits, likes and shares. They also appeal to the parts of the audience other BBC content doesn’t reach.

Case Study - Who Is Your Olympic Body Match?

You can find the Olympic Body Match calculator using this link:

At the BBC, we know that the Olympics provide us an opportunity to reach a part of the audience that doesn’t often think of us.  Let’s call them Main Eventers – they are people who don’t like to be left out of those water cooler conversations when a big national or international sporting event is going on.  So they want some way of engaging with a story that they often don’t know much about. Perhaps they are not big sports fans.

big-data-made-small

Read More »

Post a Comment

SAS Studio Webwork library demystified

In a previous blog about SAS Studio I’ve briefly introduced the concept of using the Webwork library instead of the default Work. I also suggested, in SAS Global Forum 2016 paper, Deep Dive with SAS Studio into SAS Grid Manager 9.4, to save intermediate results in the Webwork library, because this special library is automatically assigned at start-up and is shared across all workspace server sessions. In the past days, I received some request to expand on the properties of this library and how it is shared across different sessions. What better way to share this information than writing this up in a blog?

As always, I’d like to start with a reference to the official documentation. SAS(R) Studio 3.5: User’s Guide describes the Webwork library, along with its differences with respect to the Work library, in the section about the Interactive Mode. The main points are:

  • Webwork is the default output library in interactive mode. If you refer to a table without specifying both the libref and the table name, SAS Studio assumes it is stored in the Webwork library.
  • The Webwork library is shared between interactive mode and non-interactive mode. Any data that you create in the Webwork library in one mode can be accessed in the other mode.
  • The Work library is not shared between interactive mode and non-interactive mode. Each workspace server session has its own separate Work library, and data cannot be shared between them.
  • Any data that you save to the Work library in interactive mode cannot be accessed from the Work library in non-interactive mode. Also, you cannot view data in the Work library from the Libraries section of the navigation pane if the data was created in interactive mode.

In addition to this, we can list some additional considerations:

Read More »

Post a Comment

Enhancing SAS Asset Performance Analytics’ Root Cause Analysis with Calculated Columns

During a recent customer visit, I was asked how to include a calculated variable within SAS Asset Performance Analytics’ (APA) Root Cause Analysis workflow. This is a simple request. Is there a simple approach to do this?

To remind you, in the APA workflow, an ETL Administrator makes a Data Mart available in the solution for the APA users. They can select variables and explore, analyze and create a model based on the columns present in the Data Mart.

But if you want to analyze a calculated column, like a difference between two variables, do you need to change the Data Mart? Yes, if you want the APA Data Selection to include this calculated column. But this takes time, and do you really need to change the Data Mart? No!

A simpler and faster approach to adding a calculated column is modifying the APA Root Cause Analysis workflow. And, this is simple!

SAS Asset Performance Analytics is easily and highly configurable. You can easily customize analytical workflows by modifying its underlying stored process. Let me show you how to customize an existing analysis and add a calculation step to enhance APA’s Root Cause Analysis with a calculated column.

Benefits

The main purpose of this customized analysis is to avoid the SAS Enterprise Guide utilization. The users are rarely SAS experts. Thereby, asking users to switch between the tools depending on the functionalities availability on the APA GUI isn’t recommended. The more you can do within the APA interface via wizard-guided workflows, the easier it will be.

The second benefit is the keeping the Data Mart limited to only crucial variables. Instead of asking an ETL Administrator to add non-validated and/or infrequently used calculated columns to the Data Mart, allow the APA user to test and create meaningful tags to enhance workflows as needed. Once the APA user identifies new and meaningful calculated variables, they can easily be added to the Data Mart and available to APA Explorations and APA Stability Monitoring. Limiting only critical variables within the Data Mart will ensure the data size is optimizing and only adjusted as needed.

Read More »

Post a Comment

SAS integration with Hadoop - one success story

Nearly every organization has to deal with big data, and that often means dealing with big data problems. For some organizations, especially government agencies, addressing these problems provides more than a competitive advantage, it helps them ensure public confidence in their work or meet standards mandated by law. In this blog I wanted to share with you how SAS worked with a government revenue collection agency to successfully manage their big data issues and seamlessly integrate with Hadoop and other technologies.

Hadoop Security

We all know Hadoop pretty well, and if you haven’t heard of Hadoop yet, it is about time you invest some resources to learn more about this upcoming defacto standard for storage and compute. The core of Apache Hadoop consist of a storage part known as HDFS (Hadoop Distributed File System) and a processing part (called MapReduce). Hadoop splits large files into large blocks and distributes them across the nodes of a cluster.

Hadoop was initially developed to solve web-scale problems like webpage search and indexing at Yahoo. However, the potential of the platform to handle big data and analytics caught the attention of a number of industries. Since the initial used of Hadoop was to count webpages and implement algorithms like page-rank, security was never considered a major requirement, until it started getting used by enterprises across the world.

Security incidents and massive fines have become commonplace and financial institutions, in particular, are doing everything to avoid such incidents. Security should never be an afterthought and should be considered in the initial design of the system. The five core pillars of Enterprise Security are as follows:

Read More »

Post a Comment

Securing sensitive data using SAS Federation Server at the data source level

Data virtualization is an agile way to provide virtual views of data from multiple sources without moving the data. Think of data virtualization as an another arrow in your quiver in terms of how you approach combining data from different sources to augment your existing Extract, Transform and Load ETL batch processes. SAS® Federation Server is a unique data virtualization offering that provides not only blending of data, but also on-demand data masking, encryption and cleansing of the data. It provides a central, virtual environment for administering and securing access to your Personally Identifiable Information (PII) and other data.

Data privacy is a major concern for organizations and one of the features of SAS Federation Server is it allows you to effectively and efficiently control access to your data, so you can limit who is able to view sensitive data such as credit card numbers, personal identification numbers, names, etc. In this three part blog series, I will explore the topic of controlling data access using SAS Federation Server. The series will cover the following topics:

Part 1: Securing sensitive data using SAS Federation Server at the data source level
Part 2: Securing sensitive data using SAS Federation Server at the row and column level
Part 3: Securing sensitive data using SAS Federation Server data masking

SAS Metadata Server is used to perform authentication for users and groups in SAS Federation Server and SAS Federation Server Manager is used to help control access to the data. In this blog, I want to explore controlling data access to specific sources of data using SAS Federation Server.  Obviously, you can secure data at its source by using secured metadata-bound libraries in SAS Metadata Server or by using a database’s or file’s own security mechanisms. However, SAS Federation Server can be used to control access to these data sources by authenticating with the users and groups in SAS Management Console and setting authorizations within SAS Federation Server Manager.

Read More »

Post a Comment

SAS Global Forum 2017 is closer to home, or should I say…

sasgf2017_globe_150x150-002est plus près de la maison, está más cerca de casa, está mais perto de casa, dichter bij huis, is closer to home, eh!

In analytics and statistics, we often talk about sample sizes. The size of the data sets that you analyze are a measure of the amount of information contained within those data. When observations are very similar or correlated due to study design, then the information added by having multiple (correlated) observations may be negligible. This is a common problem with clustered data; the information contained in clustered data is closer to the number of clusters than to the number of observations. As a result, study designers seek to measure many clusters.

When it comes to global presenters, SAS Global Forum is seeking more clusters.

Global representation at SAS Global Forum enriches the conference experience for all attendees, providing each of us with more innovation and information to advance the goals of our organizations.

However, we know that attending our conference from the far corners of the globe is expensive … but not as expensive as it used to be! We’ve got good news for SAS users who reside outside the contiguous 48 states of the United States (residents of Alaska, Hawaii, and U.S. territories, read this carefully!).

To ease the financial burden of travelling from afar to the conference, two new policies have been adopted by the SAS Global Users Group – largely in response to your concerns about cost.

Doubled discount for accepted contributed sessions

Each year, SAS Global Forum attracts about 700 proposed sessions from the user community. The review process is competitive as we can only accept 400 session talks. To attract even more submissions from around the globe, we’ve raised the registration discount from 25% to 50% for accepted proposals from the international user community. If you reside outside the 48 contiguous States, and your abstract is approved, you will automatically receive the 50% discount when you register.

As of the writing of this blog, SAS Global Forum 2017 will include four sessions from Africa, nine from Australia, 18 from Asia, 12 from South America and the Caribbean, 37 from Canada, 21 from Europe, and 23 from the United Kingdom. With this new policy, we expect far more in 2018 and beyond!

Read More »

Post a Comment

Modifying variable attributes in all datasets of a SAS library

Using the DATASETS procedure, we can easily modify SAS variable attributes such as name, format, informat and label:

proc datasets library=libref;
  modify table_name;
    format var_name = date9.;
    informat var_name = mmddyy10.;
    label var_name = 'New label';
    rename var_name = var_new_name;
run;
quit;

We cannot, however, modify fixed variable attributes such as variable type and length.

Notice that we apply variable attributes modifications to one table at a time; that table name is specified in the modify statement of the proc datasets. There might be multiple modify statements within a single proc datasets to apply modifications to multiple data tables.

But what if we need to change variable attributes in all tables of a data library where that variable is present? Say, there is a data model change that needs to be applied consistently to all the data tables in that model.

Business case

Let’s consider a business case for a SAS programmer working for a pharmaceutical company or a contract research organization (CRO) managing data for a clinical trials project. Imagine that you need to apply some variable attributes modifications to all SDTM (Study Data Tabulation Model) and/or ADaM (Analysis Data Model) datasets to bring them in compliance with the CDISC (Clinical Data Interchange Standards Consortium) requirements. Obviously, “manually” going through dozens or hundreds of data tables in different domains to identify whether or not they contain your variable of interest is not an efficient solution.

Read More »

Post a Comment

Picture-in-Picture - It’s Not Just for Television Anymore

ProblemSolversWith fall comes cooler weather and, of course, football. Lots of football. Often times there will be two NFL games on that my husband wants to watch at the same time. Instead of flipping back and forth between two television stations, he can watch both games simultaneously, thanks to the picture-in-picture feature that we have on our television. This same concept works for SAS® ODS Graphics.

Have you ever been viewing two graphs across pages, flipping back and forth between the two and wishing you could see them together? Now you can. The Graph Template Language (GTL) and PROC SGRENDER enable you to produce a graph inside of a graph, similar to the picture-in-picture feature on your television.

The Game Plan

In this example, we are going to create a graph in the upper right corner of the axis area of a larger graph. When we define the GTL, we always start with the same GTL wrapper, as is shown below. In the wrapper below, INSET is the name of the GTL definition:

proc template;
define statgraph inset;
begingraph;
 
/* insert the code that produces the graphics output */
 
endgraph;
end;
run;

For demonstration purposes, we are going to use the SAS data set Sashelp.Heart and we are going to plot the variable CHOLESTEROL. The ENTRYTITLE statement defines the title for the graph. This statement is valid within the BEGINGRAPH block or after the last ENDLAYOUT statement. The plotting statements are contained within a LAYOUT block. In our example, we have enclosed the HISTOGRAM and DENSITYPLOT plotting statements inside a LAYOUT OVERLAY block. A standard axis is displayed with the BINAXIS=FALSE option in the HISTOGRAM statement. In the PROC SGRENDER statement, we point to the template definition, INSET, using the TEMPLATE option.

Read More »

Post a Comment

Use images in SAS Visual Analytics to enhance your report link

If your SAS Visual Analytics report requirements include linking out to separate reports without the need to pass values, you may want to consider using images to enhance the appearance of your base report. Here are three style examples using images that you can use depending on your report design requirements and report user preference:

1.     Visually appealing
2.     Generic
3.     Screenshot of actual report.

There is no better substitute for looking at examples so here are some screenshots for you:

1.     Visually appealing

Use images in SAS Visual Analytics

Read More »

Post a Comment