If you're wanting to become a 'data scientist' then you should probably learn SAS/STAT ... and this blog shows you the basics of how to run a statistical analysis in the free SAS University Edition.

In my previous blog posts, you learned how to install SAS University Edition, and how to create some basic graphs in SAS. But in order to become a highly paid data scientist, you need to know how to do more than simply graph the data - you need analytics. And the SAS/STAT product is one of the best tools for performing statistical analyses. In this blog I show you how easy it is to run data through a SAS/STAT procedure, and produce some really impressive graphical visualizations of the results.

First we need some (fake) sample data. In my previous blogs I showed you how to use sample data that was included with SAS. This time I'll show you how to create your own (random) sample data from scratch. In the code below, I loop through and create 1000 lines of data in a data step. Copy-n-paste the following into your CODE window, and run it (click the button with the little icon of a 'running man'):

data fakedata; do i = 1 to 1000; z1 = rannor(125); z2 = rannor(125); z3 = rannor(125); x = 3*z1+z2; y = 3*z1+z3; output; end; run; |

Once you have successfully run the code and created the random sample data, now you can use Proc KDE to analyze it and generate some impressive graphics (the KDE procedure performs bivariate kernel density estimation).

proc kde data=fakedata; bivar x y / plots = contour contourscatter histogram surface; run; |

And if you've done everything correctly, you'll get the following:

But let me leave you with a **stern warning** ... Please don't just blindly run the SAS/STAT procedures without understanding what they do. You need to understand the assumptions & requirements for the data, and have a good basic knowledge of what the analysis is doing, for each statistical analysis you perform. Just because a SAS statistical procedure can run against your data without producing any 'ERROR' messages, does not mean that statistical analysis was valid for that particular data.