Jedi SAS Tricks: Warp Speed DATA Steps with DS2

I remember the first time I was faced with the challenge of parallelizing a DATA step process. It was 2001 and SAS V8.1 was shiny and new. We were processing very large data sets, and the computations performed on each record were quite complex. The processing was crawling along on impulse power and I felt the need - the need for warp speed!

From the SAS log we could see that elapsed time was almost exactly equal to CPU time, so we surmised that the process was CPU bound. So with SAS/CONNECT licensed on our well-provisioned UNIX SAS server and an amazing SUGI paper extolling the virtues of parallel processing with MPCONNECT in hand, we set out chart a course in this brave, new world. The concept behind MPCONNECT is to write a SAS control program that breaks your data up into smaller pieces, spawns several identical DATA step jobs to process the pieces in parallel, monitors progress until they all finish, then reassembles the individual outputs to obtain the final results. Labor intensive, for sure, but it definitely accelerated processing of CPU-bound jobs.

But now I have SAS9.4 with the new DS2 programming language. This was built from the ground up with threading in mind - and suddenly parallel processing with the DATA step just became a whole lot easier! For example, here is a (senseless, I’ll admit) CPU intensive base SAS DATA step program:

data t1;
   array score[0:100];
   set t END=LAST;
   do i=LBOUND(SCORE) to hbound(score);
      Score[i]= (SQRT(((id * ru * rn) / (id + rn + ru))*ID))*
                (SQRT(((id * ru * rn) / (id + rn + ru))*ID));
   end;
   count+1;
   if last then put 'Data step processed ' count 'observations.';
   drop i count;
run;

When executed, this process consumes about the same amount of CPU time as elapsed time:

NOTE: DATA statement used (Total process time):
      real time           5.20 seconds
      cpu time            5.11 seconds

I suspect the process is CPU bound and could benefit from threading. First, I’ll try this as a straight DS2 DATA step:

proc ds2;
data t2/overwrite=yes;
   dcl bigint count;
   drop count;
   vararray double score[0:100] score0-score100;
   method run();
      dcl int i;
      set t;
      do i=LBOUND(SCORE) to hbound(score);
         Score[i]= (SQRT(((id * ru * rn) / (id + rn + ru))*ID))*
                   (SQRT(((id * ru * rn) / (id + rn + ru))*ID));
      end;
      count+1;
   end;
   method term();
      put 'DS2 Data step processed' count 'observations.';
   end;
enddata;
run;
quit;

This process is still running single-threaded, and uses about the same resources and elapsed time as the original, with a little extra (as expected) for the PROC overhead:

NOTE: PROCEDURE DS2 used (Total process time):
      real time           5.98 seconds
      cpu time            5.86 seconds

Now, let’s convert the process to a thread. First we create the THREAD program, which will be stored in a SAS library. I’m going to store it in WORK in this case. To convert the DS2 DATA step to a THREAD step, I'll simply change the DATA statement to a THREAD statement and the ENDDATA statement to ENDTHREAD:

proc ds2;
thread th2/overwrite=yes;
   dcl bigint count;
   drop count;
   vararray double score[0:100] score0-score100;
   method run();
      dcl int i;
      set t;
      do i=LBOUND(SCORE) to hbound(score);
         Score[i]= (SQRT(((id * ru * rn) / (id + rn + ru))*ID))*
                   (SQRT(((id * ru * rn) / (id + rn + ru))*ID));
      end;
      count+1;
   end;
   method term();
      /*Make each thread report how many obs processed*/
      put 'Thread' _threadid_ ' processed' count 'observations.';
   end;
endthread;
run;
quit;

Executing that program creates the thread and stores it in the WORK library in a dataset named th2. Now to write a short DARA step program to execute 4 of the threads in parallel:

proc ds2;
/*Multi-threaded*/
data th4/overwrite=yes;
   dcl thread th2 t;
   method run();
   set from t threads=4;
   end;
enddata;
run;
quit;

And the clock time is significantly reduced, at the expense of extra CPU time. Note that the CPU time is longer than the elapsed time indicating operations were conducted in parallel. The routine in the thread’s TERM method reports how many observations each thread processed.

Thread 3  processed 281152 observations.
Thread 2  processed 219648 observations.
Thread 1  processed 294528 observations.
Thread 0  processed 204672 observations.
NOTE: PROCEDURE DS2 used (Total process time):
      real time           3.20 seconds
      cpu time            9.20 seconds

Our threaded process cut the elapsed time almost in half!

That’s all I have for this time. As usual, you can download a ZIP file containing a copy of this blog entry and the code use to create it from this link.

Now I’m off to participate in SAS Global Forum 2015 in Dallas. There are tons of presentations that talk about DS2, SAS in-database processing and using SAS with Hadoop. Look me up! I can be found at the #SASGF15 #TweetUp Saturday night, attending various presentations (especially about DS2 and Hadoop), or hanging out in the Quad on Tuesday afternoon from 2 to 2:30 pm to answer you questions about SAS Foundation programming or DS2. I'm also teaching the post-conference DS2 Programming Essentials class at the conference center. So, I hope to see you there.

Until next time, may the SAS be with you!
Mark

Post a Comment

Technical experts on hand at SAS Global Forum

The SAS Training and Certification groups are excited to participate in SAS Global Forum 2015! We’ll have a booth in the Quad where you can stop by to ask questions, talk to your favorite instructor and register to win an iPad! We offer courses on almost every SAS product so to make thing easier on you, we've put together a schedule of when experts are available in each topic area.

SGFTraining

Do you have a question about certification? SAS Global Certification manager, Terry Barham will be giving an overview of the SAS Certification program on Monday, April 27 at 12:30 p.m. He’ll also be in the certification booth in the Quad during the conference. Our certification program was recently recognized by Certification Magazine as being the “sweet spot” for certification in big data.

There is also an eLearning booth where you can sit down and experience SAS eCourses for yourself. Nine eCourses will be available on laptops and iPads. Topics include SAS Enterprise Guide, Programming, SAS Macro Language, Predictive Modeling, JMP Software, Credit Risk Modeling and SAS Certification practice exams.

If you’re not attending SAS Global Forum, we’re always available to answer your questions about SAS training and certification. Contact us at training@sas.com.

Post a Comment

More reasons to stop smoking!

Smoking is an addictive habit that can kill you - if you don't believe me, check out the infographic in this blog post.

Recently a friend of mine was on the episode of the Dr. Phil show that focused on "quitting smoking." Here's a picture of Traci with Dr. Phil ...

traci_with_dr_phil

Being a non-smoker myself, and seeing very little smoking among my co-workers (smoking isn't allowed on the SAS campus), I hadn't really given much thought to the dangers of smoking. But when my friend mentioned that she was quitting, I did a few web searches on the topic and the statistics are indeed quite scary. I found an infographic on the Centers for Disease Control and Prevention's website, and decided to try to reproduce it with SAS software.

I used the same technique that I demonstrated in the art & analytics blog a few weeks ago, and created the custom donut pie chart using annotate functions, and then annotated colored polygons (using a slightly lighter shade of the pie slice colors) extending out to the side edges of the graph area. I then annotated the text & numbers on the graph. If you click graph below, you can see the interactive version with hover-text and drilldowns on the donut pie slices.

smoking_deaths

Best of luck to Traci, and anyone else out there trying to stop smoking!

Post a Comment

5 questions with analytics expert Bart Baesens

baesens

Bart Baesens

If anyone knows how to finesse insight out of data, it’s Bart Baesens, professor at KU Leuven (Belgium), and a lecturer at the University of Southampton (United Kingdom).

Not only has he written a book about it, Analytics in a Big Data World: The Essential Guide to Data Science and Its Applications, but he also teaches a number of Business Knowledge Series courses, including Advanced Analytics in a Big Data World.

And in his spare time (because I don’t think he requires sleep) he tutors, advises and provides consulting support to international firms with respect to their analytics and credit risk management strategy.

Despite his busy schedule, he’s always available to answer my questions about the latest in analytics. So here they are – I kept it to just five for him.

  1. What is your advice for organizations trying to implement the latest trends such as mass customization, personalization, Web 2.0, one-to-one marketing, risk management, and fraud detection?

In a nutshell, it would be: invest in data and analytics!  The applications you mention all require data, typically collected across a diversity of channels (e.g. on-line, off-line, mobile, web, email, etc.).  The data collected sheds a unique and comprehensive perspective about a customer’s behavior and/or engagement.  By using analytics, organizations can get a clear picture about this, which will allow them to gain competitive leverage and explore new strategic opportunities.  Obviously, it is hereby of key importance that the data is of good quality. That’s why firms are more and more investing in data governance initiatives.

  1. What’s the biggest mistake organizations are making when trying to implement big data strategies? And how can they fix it?

Well, actually, there a few if you ask me.  First of all, big data and analytics should be embedded into a firm’s DNA.  In other words, it should be supported by all decision levels in the company, from operational to tactical and strategical.  That’s why it’s of crucial importance to set up the necessary corporate governance initiatives in terms of organizational impact, logistics and support (both hardware and software), and of course: education and training!  Furthermore, big data & analytics is not magic so make sure to appropriately level set your expectations at the outset of the project.  Finally, there are still business settings where data is only available in small quantities.  Just think about new or very specific products for example.  In those settings, it is important to optimally combine the (often tacit) business knowledge with the limited data available using specialized (e.g. Bayesian network) techniques.

  1. How do you see emerging data science techniques changing business processes in the future?

I think there will be multiple effects.  First of all, thanks to data science, the performance and efficiency of business processes will improve. This will result into cost savings and/or value creation.  A next effect will be regulatory compliance.  Given the impact of analytics, which is now bigger than ever before, we see more and more regulatory guidelines being introduced to develop analytical models.  Just think about the Basel and Solvency accords in risk management for example.  Another popular example concerns privacy regulation.  Data science techniques will allow to ensure that business processes are regulatory compliant.  Last but not least, data science can provide better transparency into business processes by providing new insights into customer behavior.  Think about fraud detection for example, where data science can uncover new fraud mechanisms which can then in turn be used to develop better fraud prevention business processes.

  1. You recently developed a course, Advanced Analytics in a Big Data World. What real-world skills can students pick up in the class?

In a first lesson, I start by zooming into the analytical process model and discuss the key characteristics of an analytical model: accuracy, interpretability, operational efficiency, economical cost, and regulatory compliance.  This is followed by a discussion of how state of the art analytical techniques can be used to develop analytical models satisfying these characteristics in settings such as credit risk modeling, fraud detection, churn prediction, customer segmentation, customer lifetime value modeling, etc.  Techniques discussed are: decision trees and ensemble methods (bagging, boosting and random forests), neural networks, support vector machines, Bayesian networks, survival analysis and social networks.  The course concludes by discussing how to monitor and backtest analytical models.  It includes lots of real-life examples and case studies across diverse settings.  I also extensively report on my recent research findings and industry consulting experience.

  1. Any concluding advice you have for aspiring data scientists?

Yes, sure!  The world is changing at a faster pace than ever before.  Just think about the Internet of Things, drones, self-driving cars, etc.  I believe we are only at the start of the data avalanche.  To spearhead the competition, it is of key importance to continuously educate yourself, understand new technologies and see how they can create added business value.  Knowledge is power, remember! I hope my new E-learning course can contribute to shape the next generation of data scientists!

My last interview with Bart Baesens at the Analytics 2014 conference in Frankfurt.

Here's a photo from another interview I conducted with Bart Baesens at the Analytics 2014 conference in Frankfurt.

Post a Comment

Analyzing wait times at VA health care facilities

Data about the monthly wait times at VA facilities in the US are now available, but it's a bit overwhelming to try to analyze them in tabular form - plotting the data on a map made it a lot easier!...

Here in the US, when our soldiers finish their commitment in the military (retire, or are honorably discharged), they are allowed to utilize the VA health care facilities. But the VA facilities have been under a lot of scrutiny lately - in particular for long wait times.

A recent article in our local news mentioned that the worst VA wait times are in the South. The article mentioned several specific examples, but being a data person, I wanted to see the actual data. I looked around a bit and found the actual data for February 2015. Here's a screen-capture of a portion of the table:

va_table_cap

Unfortunately the data are in a table in a pdf file, which makes it quite cumbersome to work with. I ended up copying and pasting it one line at a time into a simple text file I could import into SAS. I got all major rows for each group of facilities (rather than trying to get each individual facility). I then used Proc Geocode to estimate a lat/long for each facility, and annotated them as markers on a map, color-coded based on the number of appointments completed in under 30 days. (Click the map below to see the interactive version, with html hover-text for each marker.)

va_hospital_wait_times_feb_2015

At this level of aggregation, it does appear that the South might be doing a bit worse than the Northeast, and my state (North Carolina) has some red, orange, and yellow markers (which will hopefully be improving). But rather than trying to compare all the facilities across the nation, I liked that the map allowed me to see where the facilities are located, and hover over them to see their data.

My next step would be to plot all the individual facilities (instead of the aggregate data) - and it would be *great* to find a more convenient version of the data (maybe a spreadsheet or csv file). If anybody knows of a better data source, let me know (hint, hint!)

And to close this blog post, here's a picture of my friend Trena's husband, proudly serving his country - hopefully by the time he's out of the military, we'll have all the facilities running like well-oiled machines, with short wait times and good service!

soldier

Post a Comment

A custom map to help track the flu

Has this year's flu been better or worse than you thought it would be?

There are a lot of factors that help determine whether or not you're likely to get the flu. Is there a bad strain going around? Did the flu vaccine target the right strain? Did you get the flu shot? Has the weather been cold & wet? How has your health been poor in general? Have you had to care for family members who had the flu? Etc, etc, etc.

And I guess a lot of flu-factors get rolled into geography - if the flu is "going around" in your area, then you're probably more likely to get it. Which is why I was happy to find the CDC's flu map! It shows all the US states (and a few other areas) color-coded by the prevalence of the flu! Here's a screen-capture of their flu map:

cdc_influenza_orig

Of course, any time I see a nice map, I naturally want to try to create it in SAS. The CDC map only had 2 challenging aspects that I didn't know the exact code for, right off the top of my head. The first was the cross-hatch patterns - I knew SAS/Graph could do them, but I didn't know the exact syntax. After a quick visit to the pattern statement help page, I determined that the 2 special map patterns could be coded as m4x45 and m4n90. The second challenge was including the territories (such as Guam, US Virgin Islands, and Puerto Rico) in the US map. I decided to subset them out of the world map, re-size & re-scale the x/y coordinates, and then combine them with the US map. Here's a link if you'd like to see the exact SAS code that was used.

The results came out looking very close to the original (see below). And one extra bonus feature of my map is that I added html hover-text for each state - this can be helpful to anyone who is analyzing the data, but in particular allows vision-impaired people to explore the map using Voice-over technology (as the hover over each state, the state name and flu prevalence is read out loud). Click the map snapshot below, to see the interactive version with hover-text.
Read More »

Post a Comment

Landing a SAS Certification

Lauren Guevara

Lauren Guevara

After working as a flight attendant for more than 20 years, Lauren Guevara was ready for a new adventure.

The inspiration for her journey came from an article she read in CNN’s Money magazine that highlighted the earning potential of a SAS Certification. Also having earned a Master of Science in e-commerce years earlier, she naturally gravitated toward the computer industry.

“My mom was the one who encouraged me to read Money magazine,” said Guevara. “The article mentioned career advancements you can make by becoming a data miner and getting certified in SAS.”

After reading the article Guevara started researching SAS online and also purchased the book, Learning SAS by Example: A Programmer’s Guide. That book started traveling the world with her. She devoted her downtime during layovers and breaks to reading. What she learned led her to a unique decision: become a SAS programmer.

Her first step was signing up for online e-learning courses in SAS Programming 1 and Programming 2. “I worked through both e-lessons and tried to learn everything before setting foot in a classroom,” said Guevara.

Eventually she felt ready for the classroom and attended SAS Programming 1 in the Charlotte, NC training center. The classroom training reinforced what she was introduced to in the e-courses and gave her an opportunity to ask more detailed questions. “Coming into this with no experience, classroom and e-learning together was the best way for me to learn it,” said Guevara. “I did a lot of fine tuning in the classroom.”

Guevara wanted to earn the SAS Certified Base Programmer Credential as a way to boost her credibility to potential employers.

“I noticed in the classroom that everyone had computer jobs or worked in the industry,” said Guevara. “Since I didn’t have that same experience, I felt it was necessary to have the credentials to back up my skills. SAS certifications are respected in the industry.”

Guevara purchased the base programming certification package offered by SAS, which included a training course, prep exam and certification exam voucher at a discounted price to help her prepare.

Another study tip she shared was reading the SAS Certification Prep Guide: Base Programming for SAS 9.

Guevara was a bit embarrassed to share that she didn’t pass the exam on her first attempt. However, she realized that it might be inspiring for others to know that it’s possible to fail and still achieve your goals. “The first time I took the exam, I wasn’t ready,” said Guevara, “but I wasn’t giving up. I went back and really started to understand the language better. You really have to know this stuff. It’s hard, but it’s possible.”

With her relentless determination, Guevara passed the base programmer exam and is working to earn the SAS Certified Advanced Programmer Credential by the end of the year. In the meantime, she’s going to attend the annual PharmaSUG event in Orlando to network with other SAS programmers.

Guevara eventually sees herself doing part-time contract work as a programmer, while still flying part time for the airline.

Who knew some simple motherly advice would lead Guevara on this life-changing path? Mom, of course! And she couldn’t be prouder of her daughter. “She sang a song when I finally passed the exam. She’s so happy.”

Learn more and start your SAS Certification journey.

Post a Comment

Applying the KISS principle to maps: An analysis of breastfeeding prevalence

How simple is too simple, when it comes to analyzing data on a map?

The KISS principle can be applied to many things, including graphs and maps. What is the KISS principle, you might ask? Well, it's not the rock band that my friend Patricia (pictured below) has been known to dress like. Instead, it is the principle of "Keep it Simple" (or one of the several variations of the wording). I think KISS is a good goal in general, but should it be applied to the actual geometry of a map? Let's experiment and find out...

patricia_kiss

I recently saw a map on dadaviz that represents each state in the US as equal-sized colored square. I thought it was an interesting approach (as it helps eliminate area size bias), and therefore I wanted to see if I could create a similar map with SAS. But as I was creating this simplified map, I noticed that many of the states were not in their proper position relative to other states (for example, Virginia was to the west of North Carolina instead of to the north, and South Dakota was to the east of North Dakota instead of to the south). And the states were also difficult to recognize, without their familiar shapes. Well, anyway, here's my SAS version of their map:

us_breastfeeding_2011

So, although their simplified map design is interesting, perhaps it takes KISS a bit too far? I wondered if it might be better to use a slightly less simplified map, such as the ones promoted by Mark Monmonier in his books How to Lie With Maps and Mapping It Out. So I created a custom US map where the states are shaped like Mark's map, and plotted the same data on it. In this map, the states are in the correct relative position, and the somewhat correct size, and therefore much easier to recognize than the squares in the previous map. But I still find it a bit difficult to recognize some of the states, and I wonder what is the benefit of the simplified shapes?

us_breastfeeding_20111

Finally, I plotted the data on a traditional US map. And personally, I prefer this one over the two simplified versions.

us_breastfeeding_20112

 

So, what's your opinion - which map do you prefer? What are the pros & cons of each map?

 

Post a Comment

UK General Election 2015: using PROC MAPIMPORT to visualise the election

Election fever has hit the United Kingdom as the days count down to 7th May 2015.  This is likely to be one of the most uncertain elections in recent memory, with nearly 10 parties struggling for votes across England, Scotland, Wales and Northern Ireland.  Results night will be tense, with the different TV channels competing for the most engaging visualisation and graphics. Gone are the days of the simple 'swingometer' which showed the shift between the most traditionally popular Conservative and Labour parties.

In my earlier blog, I looked at ways analytics could be used to forecast results.  But what is the best way to display them?  My esteemed colleague, Robert Allison, is working on how best to do this and will share his results in his forthcoming blog (stay tuned).  However, for a starter for 10, here is how you could produce a map using SAS.

Luckily, the Ordnance Survey provide open source data for electoral boundaries in the UK, in the form of 'shape' (SHP) and response (DBF) files.  You can download it here.   It's a simple matter for SAS to read in this data.

PROC IMPORT out= work.westminster_const_region datafile= " … westminster_const_region.dbf" dbms=DBF replace;
getdeleted=no;
run;

PROC MAPIMPORT datafile=" … westminster_const_region.shp" out=work.westminster_const_region_map contents;
id polygon_id;
run;

Green Party 2010, LondonYou can combine this with open source results data available on sites including Electoral Calculus to plot results to your heart's content.  I created a dataset called 'combined' and plotted the Green Party's results in 2010 in the London region.  In this 'choropleth' map, the greener the area, the more votes the Green Party got in 2010.  To do this, I had to create a 'colour ramp' ranging from very green to white using PROC TEMPLATE.

PROC TEMPLATE;
define style styles.green;
parent=styles.default;
style twocolorramp / startcolor=white endcolor=green;
style graphdata1 from graphdata1 / color=white;
style graphdata2 from graphdata2 / color=green;
end;
run;

Finally, I can plot the results using PROC GMAP.

ods html style=styles.green;
PROC GMAP map=westminster_const_region_map data=combined;
id polygon_id;
choro grn;
where region='London';
run; quit;

In the meantime, if you’re keen to find out more about government data and how analytics is shaping the future of our political thinking, check out our research with Civil Service World on Big Data in the public sector.

Post a Comment

Variations on a stickman graph: Analyzing the Twitter minions

One of our customers asked if I could show him how to reproduce a stickman graph that David McCandless (ala, Information is Beautiful) had created. I'm not sure that it's the best kind of graph for the occasion, but of course SAS can be used to create it! ...

David's graph uses 100 stickmen to represent all the Twitter users, and divides them into 5 categories. Each category is represented by a color. In the SAS dataset, I represent each stickman by an X and Y pair (for the position on the grid), and Color_value (1-5, for the 5 color categories), using the following code:

data my_data;
retain x y;
input color_value @@;
if x=. then y=5;
x+1;
if x=21 then do;
 x=1;
 y=y-1;
 end;
datalines;
1 1 1 1 2 2 2 2 2 2 2 2 2 2 4 4 4 4 4 5
1 1 1 1 2 2 2 2 2 2 2 2 2 2 5 5 5 5 5 5
1 1 1 1 2 2 2 2 2 2 2 2 2 2 5 5 5 5 5 5 
1 1 1 1 2 2 2 2 2 2 2 2 2 2 5 5 5 5 5 5 
1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 5
;
run;

I create a user-defined format so that the numeric Color_values (1-5) print in the graph legend as the desired text descriptions, and then plot the points with SAS/Graph Proc Gplot using plot y*x=color_value, and the following symbol statements (the '80'x character of the Webdings font is the stickman figure).

symbol1 font='webdings' value='80'x height=15 color=cxec008c;
symbol2 font='webdings' value='80'x height=15 color=cx8cc63f;
symbol3 font='webdings' value='80'x height=15 color=cx662d91;
symbol4 font='webdings' value='80'x height=15 color=cx00aeef;
symbol5 font='webdings' value='80'x height=15 color=cx939598;

The graph came out like this, which is almost what we want:

twitter_as_100_people

I then used axis statements to suppress the axis tick marks, numeric values, and lines (axis1 label=none style=0 value=none major=none minor=none), and have the following plot which is a much 'cleaner' version:

twitter_as_100_people1At this point, I would have "called it a day" and been done. But McCandless' version was a little more "politically correct" and had both stickmen and stickwomen ... which makes creating the graph a bit more complex. The technique I was using only allows you to have 1 color and marker shape per each category. Therefore I changed techniques, and used gplot to create a blank graph, and then used annotate to programmatically draw the stickpeople (annotate always gives you total control).

data anno_markers; set my_data;
length function color $8;
xsys='2'; ysys='2'; hsys='3'; when='a';
function='label'; style='webdings'; position='+'; size=15;
color='black';
if color_value=1 then color='cxec008c';
if color_value=2 then color='cx8cc63f';
if color_value=3 then color='cx662d91';
if color_value=4 then color='cx00aeef';
if color_value=5 then color='cx939598';
/* make even x-values the female stick-figure, and odd ones the male */
if mod(x,2)=0 then text='80'x;
else text='81'x;
run;

Add to that a few carefully placed (annotated) text labels that explain what the colors mean, and we have a graph very much like McCandless' beautiful version:

twitter_as_100_people2

What are the pros & cons of these stickman graphs, and what other graph might better represent this data?

Post a Comment