How to create fancy statistical graphs in SAS University Edition

If you're wanting to become a 'data scientist' then you should probably learn SAS/STAT ... and this blog shows you the basics of how to run a statistical analysis in the free SAS University Edition.

In my previous blog posts, you learned how to install SAS University Edition, and how to create some basic graphs in SAS. But in order to become a highly paid data scientist, you need to know how to do more than simply graph the data - you need analytics. And the SAS/STAT product is one of the best tools for performing statistical analyses. In this blog I show you how easy it is to run data through a SAS/STAT procedure, and produce some really impressive graphical visualizations of the results.

First we need some (fake) sample data. In my previous blogs I showed you how to use sample data that was included with SAS. This time I'll show you how to create your own (random) sample data from scratch. In the code below, I loop through and create 1000 lines of data in a data step. Copy-n-paste the following into your CODE window, and run it (click the button with the little icon of a 'running man'):

data fakedata;
 do i = 1 to 1000;
  z1 = rannor(125);
  z2 = rannor(125);
  z3 = rannor(125);
  x = 3*z1+z2;
  y = 3*z1+z3;

Once you have successfully run the code and created the random sample data, now you can use Proc KDE to analyze it and generate some impressive graphics (the KDE procedure performs bivariate kernel density estimation).

proc kde data=fakedata;
 bivar x y / plots = contour contourscatter histogram surface;

And if you've done everything correctly, you'll get the following:





But let me leave you with a stern warning ... Please don't just blindly run the SAS/STAT procedures without understanding what they do. You need to understand the assumptions & requirements for the data, and have a good basic knowledge of what the analysis is doing, for each statistical analysis you perform. Just because a SAS statistical procedure can run against your data without producing any 'ERROR' messages, does not mean that statistical analysis was valid for that particular data.

Post a Comment

How to create a bubble plot in SAS University Edition

Are you a fan of Hans Rosling's famous bubble plots? ... Then why not learn how to create your own bubble plots in SAS University Edition?!? :)

Perhaps you saw my SAS/GRAPH imitation of Hans Rosling's animation in a previous blog (see a snapshot of my graph below)? Or perhaps the SGPLOT version in Sanjay Matange's blog? Or maybe you're just a fan of bubble plots in general? Whatever the case, this blog will show you the basics of creating bubble plots in your free copy of SAS University Edition that you recently downloaded!


First, you'll need to have some data that makes sense to visualize with a bubble plot. You'll typically be representing 3 or 4 values with each bubble. Your X and Y variables will be represented by the position of the marker (like a regular scatter plot), and the size of the marker will represent the value of a 3rd variable. And you'll sometimes want to use a 4th variable to control the color of the bubbles.

Perhaps you already have the 'perfect' data for a bubble chart, but you'll often need to summarize your data first. There are several ways to do that in SAS - I'll show you the SQL way, since many of you are probably already familiar with SQL. Enter the following into the CODE tab of the Program 1 window, to summarize the data from the SASHELP.CARS data set (which ships will SAS). You can type the code by hand, or copy-n-paste it. Then click the Run button (icon of a little man running). Look at the log messages to make sure it ran correctly.

proc sql;
create table car_summary as
select unique origin, make,
 avg(horsepower) as hp,
 avg(mpg_city) as city,
 avg(mpg_highway) as highway
where type='Truck'
group by origin, make;
quit; run;
proc print data=car_summary;

If you entered & ran all the code correctly, the Proc Print should produce the following summarized table:

Now enter & run the following code (in the CODE tab again) to create the bubble plot. The X/Y position of the bubbles will be determined by the Highway and City MPG, the size of the bubbles will represent the Horsepower, and the color will represent the country of Origin. If you're typing the code by hand, make sure to include all the quotes, slashes, and semicolons - they are important!

Title "Truck MPG and Horsepower Comparison";
proc sgplot data=car_summary;
bubble x=highway y=city size=hp /
 group=origin datalabel=make;
 keylegend / location=inside position=bottomright;

And if you did everything just right, you should get a bubble plot that looks a lot like this ... and you're well on your way to becoming a SAS visualization expert! :)


Now that you're a bubble plot expert, what data would you like to use in your own bubble plot? (feel free to add your reply/answer in a comment)

Post a Comment

SAS Programming is going on tour

PT_160x160One of my favorite bands, Kings of Leon, is touring again this year and making a stop in Raleigh.

I didn’t want to take any chances that I might miss them playing in my hometown so I bought tickets as soon as they went on sale.

As you might agree, music is best heard live, but sometimes your only chance to experience that is when the band goes on tour.

The same can be said for training. So that’s why we’re taking three of our most popular courses, SAS Programming 1, 2 and 3 on the road for a five-city tour.

Here are the upcoming cities and dates.

  • Richmond: Programming 1 - Sept. 3-5 and Programming 2 - Oct. 15-17
  • Miami: Programming 1 - Sept. 3-5 and Programming 2 - Oct. 15-17
  • Portland: Programming 1 - Sept. 23-25, Programming 2- Oct. 15-17, Programming 3 - Nov. 4-6
  • Cleveland: Programming 1 - Sept. 3-5, Programming 2 - Oct. 15-17, Programming 3 - Nov. 12-14
  • San Jose: Programming 1 - Sept. 9-11, Programming 2 - Oct. 14-16, Programming 3 - Nov. 18-20

Get your tickets now for one of our stops. We even have best value bundles to help you save on training.

If you’re anything like me, you’ll get registered now – in case it sells out.

Post a Comment

How to create a histogram in SAS University Edition

This is a simple tutorial showing how to use SQL to subset data, and then create a histogram using Proc Sgplot, in SAS University Edition.

So you've downloaded SAS University Edition, and you're wondering "What now?" -- I would recommend exploring some of the sample data, and creating some simple charts!

To explore the sample data, select 'Libraries' along the left side, and then expand the 'SASHELP' library. This will show you a list of all the sample data that is included with the SAS University Edition. Scroll through the list of datasets, and look for names that might interest you.


For this example, we'll be using the SASHELP.HEART dataset, which contains some heart-related data about several patients. Double-click SASHELP.HEART, and it will let you browse the data in a spreadsheet-like interface. Scroll left/right in the data, and notice there is a column for Sex and columns for Diastolic, and Systolic blood pressure (these are the values we'll be using in our graph).


We could easily plot all the data, but it is very useful to know how to plot just a subset. There are several ways to subset data in SAS, but I'm going to teach you how to do it with Proc SQL ... because SQL is a very versatile tool to use, and also because many of you might already be familiar with SQL if you've worked with databases.

The following SAS SQL code will create a new dataset called male_data, containing ... you guessed it! ... just the data for the males. Type this code into the CODE tab of the Program 1 window, and then click the Run button (icon of a little man running). Yeah, I know - I'm a meanie, making you type it in, rather than copy-n-paste -- but this is part of the learning process! :)


Did your code run smoothly? If not, check to make sure you have a matching single-quote on both sides of 'Male', and make sure you have all four semicolons! Once you've got it running smoothly, then you can add the following Proc Sgplot code to create a histogram:


Double- and triple-check to make sure you have all the code typed in correctly, with all the quotes, slashes, and semicolons ... and then click the Run button. If you've done everything correctly, you should get a chart like the following:


So now you know how to view the sample data, manipulate it with SQL, and create a simple chart - you're well on your way to becoming a highly-paid SAS programmer! :)

Post a Comment

Free SAS Software for students!

Remember the episode where Oprah gave a free car to everyone in her studio audience? - Well Jim Goodnight goes one better, and gives free SAS Software to all students in the world!

When I was in graduate school, I felt very fortunate to be at NC State University, because SAS let us use their software for free. I don't know how I could have done my data/graphics intensive research without it. And now SAS is making their software available (for free) for teaching, learning, and research in higher education all over the world, with the SAS University Edition!

SAS University Edition page


Here are the basic steps to install the software (do this once) ...

Download the free Oracle VirtualBox - this is the only thing that's really 'installed' on your computer.

Download the free SAS University Edition (this basically places a pre-installed copy of SAS in the VirtualBox environment).


Here are the basic steps to start up the software (do this each time you want to run a new SAS session) ...

Double-click the Oracle VM VirtualBox icon on your desktop.


You will then get a VirtualBox window, with SAS-University-Edition visible along the left side. Click the 'Start' button (green arrow).


You'll see this window as the SAS server starts in the background ...


And after about a minute (depending on the speed of your computer) you'll see the following VirtualBox window:




And here is how to use SAS - simply enter the following URL in a Web browser on your computer:



If you have run SAS in the past, you have probably used the Display Management System (DMS) as your user interface, which lets you edit and submit code, view your results, etc. The clever SAS developers have recently implemented a new interface called SAS Studio that is very much like DMS, but runs in a web browser. Here's what it looks like:


I plan to write several blog posts describing how to do various useful things in the University Edition, but I want to wrap up this blog with a simple graph, using sample data that is shipped with the software. Type the following into the CODE window, and then click the 'Run' button (picture of the little man running).


And you get the following graph:



If you've got friends that are college students and/or faculty, be the first to tell them about this great news, and you'll be their "SAS Hero" :)

Post a Comment

Fraud is a social phenomenon

During the Analytics 2014 conference in Frankfurt, I had the chance to interview Professor Bart Baesens and PhD researcher Véronique Van Vlasselaer of KU Leuven.

Their presentation, “GOTCHA! Improving Fraud Detection Techniques Using Social Network Analytics” focused on how to prevent fraud before it happens.


I’ve posted many more of my interviews from the conference on the Inside Analytics playlist on YouTube.

Post a Comment

How many Friday the 13th will we have this year?

Yes, I use SAS for everything - even determining how many more Friday the 13th we have this year...


I woke up this morning (June 13, 2014) and realized it was Friday the 13th. We haven't had one in a while (not since last year, actually), and I got to wondering how many more we will have this year. I guess I could have flipped through a calendar and checked month-by-month, but no respectable programmer would do it that way! :)

So I wrote a little bit of SAS code that prints out every Friday the 13th this year ...

data lucky13; 
 do day = '01jan2014'd to '31dec2014'd by 1;  
 if trim(left(put(day,downame.)))='Friday' and put(day,day.)=13 then output;  

proc print data=lucky13; 
format day date9.; 

And here are the results ... just one Friday the 13th in year 2014! :)

Friday the 13th in year 2014

But, in addition to being a programmer, I'm also a "Graph Guy" ... therefore I always try to find a graphical way to visualize my results. And what better way than a custom SAS/Graph calendar chart! Click the snapshot below to see the full-size calendar, with hover-text showing the exact dates:

Friday the 13th Calendar

Post a Comment

Four reasons I am stoked about SAS Professionals Convention

ice cream boatThe 2014 SAS Professionals Convention, hosted by SAS UK and Ireland, will be June 24-26 at the beautiful headquarters in Marlow, just outside of London. SAS Marlow campus is surrounded by fairy-filled flowering gardens and there is a bluff overlooking the river Thames (where, by my honor, they have actual ice-cream BOATS; I am not joking, you must see them).  But that is only one of the reasons I am excited to go. Here are the others…

SAS Professionals Connecting

This conference is a mix of academic and professional SAS users, but what they mostly share in common is that using SAS is a regular part of their work. These are people who have great ideas, suggestions, and tricks for the most of their data. All of us will be invigorated by the ideas and tips we pick up from other users, and I plan to bring home tons of feedback to share with my colleagues here in Cary.

Sharing Ideas

I hope that this conference will also be a chance to invigorate others, as I’m giving three presentations during the conference. The first (June 24) is a discussion of how universities can prepare the next generation of business analysts. The second (June 25) addresses on strategies to bridge the analytical talent gap. The third (June 26) is a discussion of some segmentation tips and tricks for the banking industry. I’ve spent plenty of time researching, interviewing, and culling past projects to put these talks together, and I’m excited to share with attendees at the conference.


This one is completely selfish, I admit, but I’m just so excited that one of the other presenters is my friend Michelle Homes from Metacoda. She is such a delightful, vibrant, intelligent woman and I can’t wait to see her talk on creating dashboards in SAS Visual Analytics.

If you will be at the SAS Professionals convention, please introduce yourself! I’m excited to meet you and hear about how the conference is going for you. Give me tips and feedback to take back to SAS HQ.

And if you hear the jingle of the ice cream boats, I’ll race you to the shore with my fifty pence in hand!

Post a Comment

Tracking sea turtles with SAS

SAS software can be used for many things - here's how you could use it to help save endangered sea turtles!

You've seen SAS used to track endangered wildlife by the shape of their footprints, and help run Texas Parks & Wildlife. Now, how about tracking the movements of (and thereby helping save) endangered animals such as sea turtles?!? To get you in the mood for such a blog, here's a photo that a friend of mine who is an avid scuba diver made on one of her many dive trips:


One of the latest tools to help study wildlife is to attach a transmitter to them, that broadcasts their GPS location. The data can then be collected, to see where the animal travels. The OCEARCH group uses this technique to study sharks, for example.

So I obtained some data for a sea turtle that was tracked right off the North Carolina coast, and plotted the points on a map. I added html hover-text to the markers showing the date/time/etc at each point, and added html drill-downs so you can click the markers to see a Google map centered on that location. I annotated arrows between the points, so you can easily see the order of the points. And I annotated the latitude/longitude gridlines, to provide a visual context of exactly where the points are on the map.

Here's a snapshot of my map - click on it to see the interactive version with the hover-text and drill-downs. It's interesting to see how the turtle made its way from north to south, and swam on both the sound-side and the ocean-side of the barrier islands.



Did you know you could do this with SAS? What other things might you track on a map like this? Here is a link to the SAS code, in case you'd like to re-use it to plot some of your own data!


Post a Comment

Producing normal density plots with shading

When teaching statistics, it is often useful to produce a normal density plot with shading under the curve. For example, consider a one-sided hypothesis test. An alpha value of .05 would correspond to a Z-score cutoff of 1.645. This means that 95% of a standard normal curve falls below a value of 1.645. This also means that 5% of a standard normal curve falls above 1.645. So, how might we demonstrate these concepts graphically in SAS?

Graphing a normal curve without any shading is straightforward. To begin with, we create a data set containing the values for the x-axis, the values of the standard normal pdf, and a final variable set to zero.

data pdf;
  do x = -4 to 4 by .001;
    pdf = pdf("Normal", x, 0, 1);
    lower = 0;

We next plot our data set using SGPLOT.

title 'Standard normal probability density function';
proc sgplot data=pdf noautolegend noborder;
  yaxis display=none;
  series x = x y = pdf / lineattrs = (color = black);
  series x = x y = lower / lineattrs = (color = black);


The variable LOWER, which was set to 0, was included to show that the PDF values asymptote at zero for high or low values of X. This is not essential to the plot, but it adds a little extra clarity. I also changed a few other plot options (i.e., removing the legend, removing the border, removing the y-axis, and specifying the colors for the lines) to simplify the appearance of the plot.

Adding shading to a normal PDF plot requires a few extra steps. SGPLOT does not allow us to directly specify a shape to be shaded, but it does allow for shading between two lines, or bands, using the BAND statement. For example, we can create a standard normal PDF by adding a band between 0 and the PDF values:

title 'Standard normal probability density function with shading';
proc sgplot data=pdf noautolegend noborder;
  yaxis display=none;
  band x = x lower = lower upper = pdf / fillattrs=(color=gray8a);
  series x = x y = pdf / lineattrs = (color = black);
  series x = x y = lower / lineattrs = (color = black);


We are not usually interested in shading the entire area under the curve, however. Instead, we are more likely to want to shade an area that is below or above some cutoff. For example, I previously mentioned that 95% of a standard normal curve falls below a value of 1.645. To demonstrate this, we can shade the area of a standard normal curve that falls below the cutoff of 1.645. Unfortunately, we cannot simply tell SGPLOT to only display the band below a particular value of X. Instead, for values of X that should not be shaded we can set the size of the band to zero, producing a line rather than a band. To do this, we add a new variable, UPPER, to the data set. Upper is equal to the standard normal PDF for values of X that we wish to be shaded, and zero otherwise.
Read More »

Post a Comment