Easter forecasting challenges the Easter Bunny brings

The date of Easter influences our leisure activities

Easter forecasting challengesDifferent from many other public holidays, Easter is a so-called movable holiday. This means that the Easter bunny brings more than just eggs for the statistician - he brings special Easter forecasting challenges. In the year 325 CE the Council for Nicea determined that Easter would fall on the Sunday after the first full moon in spring. The earliest date for Easter is thus March 22nd, the latest April 25th. Carl Friedrich Gauss took this rule into consideration and developed the Gauss algorithm for Easter, which allows the determination of the date of Easter for a given year.
The moving date of Easter matters, because this holiday has a very strong influence on our leisure activities. Many people use the feast of Easter to take a spring holiday, which in turn has a strong impact on the hotel, restaurant, and transportation industries. Our choice of activities also depends on the timing of Easter. For many of my fellow Europeans, when Easter falls at the end of March, we may plan a ski vacation, while an April date may call for plans that involve enjoying the warming spring weather instead of seeking the snow.

Analysis of the data mirror that behavior

Time series data on a monthly basis reflects this behavior. In a multi-year observation you often see that the peaks switch between March and April.

  • When Easter falls in March, winter sports resorts in the Austrian Alps often have a full hotels in that month. In April, however, occupancy drops very sharply, because Easter often also means end of the season for the ski lifts.
  • The effect of the importance of the Easter holidays can also be observed by the number of air and rail travelers or visitors to recreational facilities.

The following graph shows the number of visitors at a leisure park during the period from 1993 to 2000. In order to facilitate recognition, only the months March (blue) and April (red) are shown. In addition a green step curve shows whether the beginning of the Easter falls in March or April in each year.

It is easy to see that for a holiday that starts in March, the blue line is higher than the red line. For a holiday that starts in April, the reverse is true. We can clearly see the behavior of consumers and tourists can in our analysis of the data.

In the evaluation of results, the consideration of this fact is important. From a sales perspective it does not make sense, to panic when the March sales fall behind those of the previous year when the "Easter Peak" is expected only in April.

Easter forecasting challenges

A peak that varies between March and April can be a problem for time series forecasting models. Many simple time series models only use the historical values of the time series. If Easter falls in April in two consecutive years, the models expect the next peak again in April, because the model has only learned the seasonal effect, leading to these Easter forecasting challenges.

This fact is important because there is an increasing trend for planning software packages to offer only a few simple standard methods of time series forecasting. For some applications a simple seasonal exponential smoothing model does quite well. But it takes a little more effort to consider the above situations. You do not want to manually correct your budget figures every time for the “Easter effect.” It is tedious to always explain your own numbers in a planning meeting with the caution,"Yes, but this year...."

Event, inputs and flexible time intervals make a difference

The forecasting solution from SAS is built to consider these points for you:

  • Automatic model selection checks for you whether the time series can be forecast with a simple model or whether a more powerful model should be used.
  • Your individual data are analyzed to select the most appropriate model from a model library and fit it to your data.
  • The forecast considers input variables like the number of workdays, Saturdays, or Sundays, as well as already-booked orders.
  • Pre-defined calendar events like Easter or Christmas can be considered in the analysis.
  • You can define individual events that are important for your company or organization that influence the course of your key performance indicators.
  • You have the possibility of adapting the length of time intervals to the seasonal pattern of your processes.

Next year, Easter takes place on March 27th; in 2017 it will fall on April 16th. Check now to see whether your forecasting software and the resulting forecast numbers consider this shift.

SAS Global Forum, another moveable date

Another event that has a moving date is SAS Global Forum. This year it takes place April 26th-29th in Dallas, Texas.
My paper, Want an Early Picture of the Data Quality Status of Your Analysis Data? SAS® Visual Analytics Shows You How“ is scheduled for Wednesday, April 29th, 10:00-10:50 a.m. (Preview the A2014 Demo Version). I look forward to meeting you there!


The unicorn quest continues – more tips on finding data scientists (part 3)

Because finding analytical talent continues to be a challenge for most, here I offer tips 5, 6, and 7 of my ten tips for finding data scientists, based on best practices at SAS and illustrated with some of our own “unicorns.” You can read my first blog post for why they are called unicorns and for tips 1 and 2 on finding them in an MS in Analytics Program or from a great program you’ve never heard of. You can read tips 3 and 4 on how to find this kind of talent outside the traditional academic disciplines. Today’s post focuses on interns, social networks, and sponsorship for permanent residency.

5. Try before you buy – create an intern program

Golbarg Tutunchi (ABD* in Industrial Engineering/Operations Research, NC State University), Fatemeh Sayyady (PhD in Operations Research, NC State University), Zohreh Asgharzadeh (PhD in Operations Research, NC State University), and Shahrzad Azizzadeh (ABD* in Operations Research, NC State University).

Golbarg Tutunchi (ABD* in Industrial Engineering/Operations Research, NC State University), Fatemeh Sayyady (PhD in Operations Research, NC State University), Zohreh Asgharzadeh (PhD in Operations Research, NC State University), and Shahrzad Azizzadeh (ABD* in Operations Research, NC State University).

Intern programs are a great way for both the employer and the student to find out if the candidate is a good fit at the organization, so SAS hires close to 200 students each year. In addition to the standard internship programs that we run, we have two special programs that are particularly useful in our quest to find analytical talent. The Graduate Research Assistant (GRA) program places PhD students from local universities into Advanced Analytics R&D for up to 20 hours/week during the year and full-time during the summer (we had 14 in 2014). These students work on research in SAS related to their graduate training, and we maintain a partnership with their academic advisor. SAS provides funding to the academic departments, who in turn provide a stipend, tuition, and benefits for the students. Academics benefit from exposure to problems in practice, SAS benefits from academic insight into our research areas, and the students receive funding but also exposure to practice. R&D also has technical student positions that function like regular internships. R&D hires most of these students when they finish, including the women above who are part of the growing Iranian diaspora at SAS—Fatemeh (who works on SAS® Marketing Optimization), Zohreh (who leads a team of operations research specialists who develop solutions for retail and manufacturing), Shahrzad (who tests our mixed integer programming solver and decomposition algorithm in SAS/OR®), and Golbarg (a current GRA student who works on our Advanced Analytics and Optimization Services team, which provides specialized consulting on optimization and simulation problems).

Advanced Analytics R&D also has an Analytical Summer Fellows program, which hires PhD students for a special summer internship to expose them to the world of commercial statistical software research and development and provide an opportunity to explore software development as a career choice. In addition to a standard summer internship salary, the students receive a stipend, which allows us to recruit students from around the US. The program began with one student in statistics in 2006, and thanks to our CEO Jim Goodnight’s support and enthusiasm for programs of this kind, we have now expanded it to include positions for students in statistics, data mining, econometrics, operations research, and text analytics.

 6. Use social networks to hire the friends of your unicorns

Left circle, Zohreh Asgharzadeh, and right circle, Shahrzad Azizzadeh, at Sangan Waterfall, northwest of Tehran, with classmates from their Sharif University of Technology Industrial Engineering class.

Left circle, Zohreh Asgharzadeh, and right circle, Shahrzad Azizzadeh, at Sangan Waterfall, northwest of Tehran, with classmates from their Sharif University of Technology Industrial Engineering class.

We know the power of social networks from an analytical perspective, but have you thought about using them in your recruiting? At SAS, 55% of our hires come from an employee referral. After all, while merit is critical choosing the best candidate to hire for a given role, having a “spotter” help you fill the pipeline can help immensely, since the pool of unicorns is small in the first place. And who better to know where to hunt for unicorns than a member of their own species? Our employees are eager to recruit people they know, because they consistently tell us they want to work with smart people.

Remember Zohreh and Shahrzad, two of our former GRA students? They have been close friends since high school in Iran. Zohreh first came to the US to graduate school at NC State University, to pursue her PhD in Operations Research. Shahrzad was still in Iran, first pursuing a MS in Industrial Engineering and then working in industry. Hearing about her friend’s positive experiences in graduate school in the US, Shahrzad decided to apply to a US graduate school herself. Zohreh worked as a GRA student at SAS on retail optimization solutions, and because she enjoyed her work opted to work at SAS full-time upon graduation. Later, when Shahrzad saw an advertisement for a GRA placement at SAS, she asked her friend Zohreh about it. Zohreh encouraged her to apply, and the rest is history. And Fatemeh (from the photo above) applied for a position in part because she had a positive impression of SAS conveyed by her friends Shahrzad and Zohreh.

These two long-time friends have quite the academic pedigree, since their math tutor in high school was Maryam Mirzakhani, who last year won the Fields Medal, widely considered the Nobel Prize of math. As the first woman in its 78- year history to have won she is a trail blazer. These women challenge the tradition of Western world, where far fewer women choose the STEM disciplines, because they were encouraged to pursue their interest in math. Curious about that difference, I found an interesting article that provides a cross-cultural analysis of students with exceptional math talent and provides some recommendations. Its conclusion is sobering: "In summary, some Eastern European and Asian countries frequently produce girls with profound ability in mathematical problem solving; most other countries, including the USA, do not." So while efforts are being made to close that gap, perhaps an additional tip should be that if you want to hire more female unicorns, seek out the Iranians!

7. Be willing to invest in sponsorship for foreign nationals

Bahadir Aral (PhD in Industrial and Systems Engineering from Texas A&M University), in front of the Süleymaniye Mosque in Istanbul.

Bahadir Aral (PhD in Industrial and Systems Engineering from Texas A&M University), in front of the Süleymaniye Mosque in Istanbul.

"In today's global economy, innovation is the key to sustained growth and success. At SAS, we have long recognized this fact; it is why we are so committed to attracting and retaining the best and brightest minds from across the globe," wrote Jim Goodnight in this opinion piece on immigration reform. Part of his interest in immigration derives from the fact that economic forecasts project that American universities will produce 1 million fewer graduates from the science, technology, engineering and math disciplines (STEM) than the US workforce needs. And he points out that American universities are not currently able to meet that need with US-born students. In the US we are fortunate that top students from around the world are drawn to our excellent higher education system, so American companies can recruit top global talent domestically. An obvious repercussion is that some of these graduates will need legal sponsorship for the papers to stay in your country. As our CEO wrote, at SAS we feel this can be an investment with high return - the fees are not significant and the talent available is high.

For example, Bahadir is a pricing expert and consultant in our Advanced Analytics and Optimization Services group in R&D. Even though he came to SAS with almost five years of professional experience, he still needed to complete paperwork for a green card to be able to work in the United States. When his work visa ran out he had to return to his native Turkey and work from our Istanbul office for 6.5 months while the next step in his immigration application was being processed. Bahadir is so dedicated that he maintained his customer meetings in spite of the seven hour time zone difference, which meant he was sometimes scheduled for telephone calls that approached the midnight hour in Turkey! Another pricing expert on his team, Natalia, is Russian by birth but as a child moved to Mexico, where she obtained her first PhD in industrial engineering at Tecnológico de Monterrey. She then moved to the US to pursue a second PhD, this time in Operations Research, from North Carolina State University. She arrived at SAS with dual Russian and Mexican citizenship but didn’t yet have permanent residency, so SAS is also sponsoring her application.

Natalia Viktorovna Summerville in front of St. Basil’s Cathedral on Red Square in Moscow.

Natalia Viktorovna Summerville in front of St. Basil’s Cathedral on Red Square in Moscow.

Last three tips next time:

8.  Hire the curious who want to solve problems
9.  Think about what kind of data scientist you need
10.  Don’t expect unicorns to grow horns overnight

What makes the smart grid “smart”? Analytics, of course!

You’ve heard about the smart grid, but what is it that makes the grid smart? I’ve been working on a project with Duke Energy and NC State University doing time-series analysis on data from Phasor Measurement Units (PMUs) that illustrates the intelligence in the grid as well as an interesting application of analytical techniques. I presented some of our findings at the North American SynchroPhasor Initiative (NASPI) workgroup meeting in Houston recently, so I thought I’d share them with a broader audience.

PMUs in the power grid

PMUs in the power grid

Phasor Measurement Units (PMU) take measurements of the power transmission grid at a much higher speed and fidelity than previous systems provided. PMUs take measurements on the power frequency (i.e. 60hz), voltage, current, and phasor angle (i.e. where you are on the power sine wave). These units take readings at a speed of 30 measurements/second, while the previous systems just took readings every 3-4 seconds. This more frequent interval provides a much more detailed view of the power grid and allows detection of sub-second changes that were completely missed before.

Another great feature of the PMUs is their very accurate time measurement. PMUs are installed at points along the power grid miles apart from each other. For example, Duke Energy has over 100 PMUs installed across the Carolinas. To analyze data and learn about the whole grid, we need to synchronize the measurements taken at these locations. PMUs have Global Positioning System (GPS) receivers built in, not to determine the location, but so all can get the same accurate time signal. Since GPS systems provide time accuracy in the nano-second range, this is sufficient for our measurements at 30/second. This accuracy is most critical in the measurement of phasor angles. By comparing the phasor angles between locations, we get a measure of the power flow between the locations. Since the measurements are of something changing at a frequency of 60hz, the time stamp of the measurement must be of significantly higher precision than what is being measured.

Working with this data has highlighted the similarity and differences between working with big data and high-speed streaming data. Big data is typically a large amount of data that has been captured and stored for analysis. Streaming data is constantly coming in a high rate of speed and must be analyzed as it is being received. One of the many interesting things about this project is that is involves both big data and streaming data.

So what have we learned working with this data? The main purpose of this project is to detect and understand events that are affecting the power grid, with the objective of keeping the grid stable. We have learned there are a number of time-series techniques that are needed for the different aspects of providing the needed answers. The analysis flow breaks down into three areas: event detection (did something happen?), event identification (what happened?), and event quantification (how bad was it?).

For event detection, the task at hand is streaming data analysis. The system generates 30 measurements/second on hundreds of sensors and tags. Fortunately a majority of the time (>99.99%) they indicate that no event of any kind is occurring. Since there are time-series patterns present they can be modeled and used to detect when there is a deviation from the normal pattern. Determining these models allows us to look forward with a very short-term forecast and then instantly detect an event of interest.

Event identification is the next order of business. An event of interest doesn’t necessarily mean there is a problem or that one will develop. Some events are random, like a lightning strike or a tree hitting a power line. Others represent some type of equipment failure. We’ve determined that many of these events produce a similar ‘signature’ in the data stream, because time-series similarity analysis and time-series clustering have been able to match the incoming events to previously seen events. Knowing which previous event signatures are non-consequential allows us to safely ignore them.

Finally we look at event quantification. For some events, the question is not just that the event is occurring but also whether the magnitude of the event gives cause for concern. An example is oscillation on the power grid. Small but diminishing oscillations are not necessarily a problem, but larger ones that are increasing may require further attention. Once the event type is identified, each has some specialized techniques to determine their magnitude and consequence.

This project has provided interesting insights into how to make the power grid smarter. Many of these techniques are also beneficial to streaming data analysis seen in other industries and applications. If there is a need to automatically identify and categorize system events based on data patterns, or filter out events that are non-consequential, then these techniques will be helpful.

Photo credits

PMUs in the power grid: Syncrophasor Technologies and their Deployment in the Recovery Act Smart Grid Programs, August 2013 report by the US Department of Energy

Firefighter image credit: photo by US Navy // attribution by creative commons

Hoover Dam image credit: photo by IAmSanjeevan // attribution by creative commons

Are there jobs for economists in analytics?

SAS will again be participating in the Allied Social Science Association annual meetings in January. This year the event will be held in Boston, and conference organizers expect more than 12,000 participants from a variety of backgrounds, including economics, finance and many other social sciences. One of the primary functions of the event, aside from traditional academic sessions, is that it serves as a single meeting place for employers and job candidates. Each year, approximately 1,000 candidates attend ASSA for the sole purpose of finding a job. As I’ve written before, corporate economists are hot again and a great source for analytical talent, so if you’re on the job market consider exploring a career in industry. It’s a great place for economists to look for a job.

I’ve worked in academia and industry, so I know both worlds. And in my current role I constantly talk with economists performing analytical functions in some of the world’s largest companies. What I have seen is that both worlds provide ample opportunity to utilize your economist skills. In fact, I would conclude that a corporate analytic role will challenge your data skills in a way unlike academia. You will be pushed to learn both methods and as an economist you will be uniquely positioned to explain results to colleagues. There is, of course, no ‘free lunch,’ so your research will be guided by the firm’s revenue maximization or cost minimization priorities.

As you may know, SAS is one of the preferred analytical computing platforms by both businesses and government. In fact, Monster.com ranked it #1 on their list of “Job Skills that Lead to Bigger Paychecks.” For these reasons I was curious to conduct a little research about jobs using SAS. I chose to search the JOE website (Job Openings for Economists), the primary listing source for jobs in economics. I limited my search to positions listed as “Full-time nonacademic,” as universities are not likely to prefer one computing language over another. Of the 247 nonacademic listings, here is what I found for each program.

Search Term Number of jobs and link
SAS 41
Stata 34
Matlab 27
R 16
Python 11
SQL* 9

*while SQL is not a computational program I would consider it a language.

SAS was explicitly listed in 41 job descriptions. Each number above is a hyperlink to the actual search, should you be interested in seeing the jobs available.

For those attending the ASSA conference, please stop by the SAS booth in the exhibit hall and say hello. And if you want to talk about the market for economists in industry or would like to meet some of this year’s job market candidates or industry representatives, please consider attending the SAS Academic and Industry Reception to be held on Sunday, January 4th at 5pm. If you are not attending ASSA but will be in the area on that date, we’d be glad to have you join us at this reception. Please click here to RSVP.

Don’t be fooled: there’s really only two basic types of distributed processing

Every time I pick up a new article about analytics, I am always disappointed by the fact that I cannot find any specifics mentioned about back-end processing. It is no secret that every vendor wishes they had the latest and greatest parallel processing capabilities, but the truth is that many software vendors are still bound by single-threaded processing – as indicated by their obvious reticence about discussing details on the subject. As a result of using older approaches to data processing, most competitors will toss around terms like ‘in-memory’ and ‘distributed processing’ to sow confusion about how their stuff actually works. I will explain the difference in this post and tell you why you should care.

The truth is that there are really only two basic types of distributed processing, namely multi-processing (essentially grid–enabled networks) and pooled-memory massive parallel processing (MPP). Multi-processing essentially consists of duplicate sets of instructions being sent to an array of interconnected processing nodes. In the latter scenario, each node has its own allocation of CPU, RAM, and data, and generally does not have the ability to communicate or share information with other nodes in the same array. While a large multi-step job can be chopped up in pieces and each piece processed in parallel, the multi-processing configuration is still largely limited by duplicate, single-threaded sets of instructions that need to run.

Contrast multi-processing with a pooled-memory architecture that has inter-node communication and does not require duplicate sets of instructions. Each node in a pooled-resource configuration can work on a different part of a problem, large or small. If any node needs more resources, data, or information from any of the other nodes, it can get what it needs by issuing messages to any of the other nodes. This makes for a truly ‘shared resources environment,’ and as a consequence it runs about ten times faster than the fastest multi-processing array configuration.

Now much of the confusion about these two types of distributed processing exists because of misuse of the term ‘in-memory’. The fact is that ALL data processing occurs in-memory at some point in the execution of a set of code instructions. So ‘in-memory’ is really a misnomer for distributed processing. For example, traditional SAS processing has always occurred in-memory as blocks of data are read from disk into RAM. As RAM allocations have gotten larger, more data has been loaded into memory, yet the instructions were still processed using a single-threaded and sequential approach. What was needed was a rewrite of the software to enable multi-threading, namely routing separate tasks to different processors. Combining a multi-threaded program with all data pre-loaded into memory produces the phenomenally fast run-times as compared to what was able to be accomplished before.

Even though a program is multi-threaded, there is still no guarantee that things will run faster. An obvious example is Mahout, an Apache project that relied on MapReduce to facilitate inter-node communication in a pooled-resource environment. MapReduce is notoriously slow, as nodes take a long time to load data into memory and must write inter-node communication requests to disk before other nodes can access the request. As a consequence of its lethargic response time, Mahout has largely been abandoned by most large business customers in favor of faster architectures.

Message Passing Interface (MPI) is a much faster communication protocol, because it runs in-memory and it can accomplish multiple data iterations that are common to predictive analytics work. Currently there are only two MPI initiatives that offer true multi-threading, one based on Spark, an in-memory plug-in to Hadoop, and SAS’ High Performance Analytics. Spark development is still in its infancy, comparatively speaking, and it will likely be years before any push-button applications can make use of its capabilities. Alternatively, SAS has products that are production-ready today and can dramatically shorten your analytics lifecycle. So, do not be fooled by claims of in-memory or distributed processing, because MPI-enabled pooled-memory processing is here to stay and bodes well to become the de facto standard for all future predictive analytics processing.

For many standard analytics jobs, your standard architecture may be sufficient. But these phenomenally-fast run times matter when you are trying to process dozens, if not hundreds, of tournaments that consist of the most advanced machine learning techniques like random forests and deep-learning neural networks. Statistical professionals are finding that these new techniques are not only more accurate, but they also allow us to investigate much lower levels of granularity than ever before. As a result, models are getting more precise and profitability is increasing concomitantly. So if you want to solve more problems faster and with more accuracy (plus use the same headcount), be sure to investigate claims of “in-memory” and choose the right architecture for your job.

Econometric reflections from Analytics 2014

This post will violate the “what happens in Vegas stays in Vegas” rule, because last week I had the pleasure of attending and participating in the Analytics 2014 event there and want to share some of what I heard for those who couldn’t attend. I was joined by over 1,000 attendees and colleagues as we gathered to share best practices in the fields of statistics, econometrics, forecasting, text analytics, optimization, data mining, and more. My talk, demo theatre presentation, and exhibit hall hours gave me the opportunity to meet many interesting people, so here are some of my thoughts based on what I heard and learned.

  • Jan Chvosta and I presented, “Why Econometrics Should Be in Your Analytics Toolkit: Applications of Causal Inference” (presentation available here) to approximately 100 attendees and sincerely appreciated the feedback we received. Of note to me was just how many audience members approached us afterward and said that “causal interpretation” is what they strive for with their predictive modeling. From marketing mix models to CCAR stress testing to price elasticity estimates, I saw many nodding heads when we talked the importance of interpretation in these models. To twist the words of Nobel Laureate Robert Lucas, “once you start thinking about causality, it is hard to think about anything else.” It appears to me that there are still many people interested in the meaning of models in this world of “big data” and “machine learning.”
  • I was able to attend Michele Trovero’s talk on using SAS/ETS® tools to estimate linear and non-linear state-space models. Michele showed several example and benefits of the existing SSM procedure as well as giving a look ahead to the upcoming ESMX procedure. Once released, this procedure will allow new exponential smoothing models to be estimated in SAS as well as provide a statistical treatment to structural time series models based on exponential smoothing.  Use this link to find his slides or listen to this interview to hear Michele describe some of these tools.
  • At the econometrics booth I was asked about many different topics. We talked about state-space models (PROC SSM), time series data management with PROC TIMEDATA and the use of multinomial discrete choice models for price elasticity measurement (PROC MDC). However, far and away the dominating topic was the new CCAR compliance regulations. For those unfamiliar with these regulations, it is a requirement that bank stress tests incorporate macroeconomic scenarios into the forecasts of their financial performance with the goal being to ensure bank solvency under adverse economic events. One of the difficulties of this problem is that the objective of the analysis is no longer strictly a predictive modeling problem. New policies dictate that certain variables must be included and that these effects should behave consistently with economic theory. No fewer than ten banks independently asked about modeling techniques to satisfy these regulations. It was quite interesting, because each bank had a different method of solving the problem. Some banks with access to micro data chose to model probabilities of default based on each asset. Other banks without access to this data have opted for a time-series based approach. For some time now, I have been working on formalizing these methods and I will present a paper on the subject both at the Conference on Statistical Practice in February as well as SAS Global Forum in April.
  • It was my pleasure to lead a roundtable discussion about statistical and econometric modeling in health insurance. It was a packed table and I apologize if you were turned away. Many of the topics discussed during our causality talk were echoed during the roundtable, most notably non-random assignment of certain interventions. In fact, one large health insurer spoke about the early returns of the Affordable Care Act with respect to substitution effects, or lack thereof, from emergency room usage to traditional clinics. This may suggest that the population now covered by new rules remains unlikely to shift their healthcare usage from high-cost emergency room visits to less costly outpatient facilities.
  • Finally I would like to thank the 15 or so attendees at my 7:30am demo theatre presentation on the new items in SAS/ETS®. These were brave and dedicated souls to be such early risers in Vegas! I always enjoy the chance to evangelize our new tools. We spent a great deal of time talking about PROC HPCDM, a tool for simulating aggregate loss distributions in insurance and banking. People were interested because there now is a very computationally efficient way to simulate aggregate losses subject to business rules about deductibles and limits. We also talked about new methods of estimating limited dependent models in the presence of endogenous regressors. There was an interesting question about tools for spatial econometric models, which isn't part of the current portfolio but will definitely be part of future presentations.
Ken Sanford being interviewed on the demo floor at Analytics 2014

Ken Sanford being interviewed on the demo floor at Analytics 2014 (click on the photo for the interview)

Unicorn hunting: finding data scientists outside traditional academic disciplines

Finding people with the range of skills classified as data science can be a challenge, which is why some call them unicorns (do they really exist?), so I recently posted ten tips on finding unicorns. In my first post I elaborated on tips 1 and 2 (1. hire from an MS in Analytics program and 2. hire from a great program you've never heard of). In this post I'll share two more tips, which entail hunting for data scientists beyond the math, stats, computer science, operations research, and engineering departments where you might most expect to find this kind of talent.

3. Recruit from untraditional disciplines

Wayne Thompson, PhD in Plant Sciences from the University of Tennessee, during his year spent as a visiting scientist at the Institut superieur d'agriculture de Lille in France.

Wayne Thompson, PhD in Plant Sciences from the University of Tennessee, during his year spent as a visiting scientist at the Institut superieur d'agriculture de Lille in France.

As this article from Inc. points out, computer science may not be the best place to find data scientists. In fact, the article refers to a survey of data scientists, of whom 51% recommend that the best source of data scientists is outside of computer science. For that matter, if you limit yourself to other, perhaps more “traditional,” analytical disciplines you may be overlooking some great candidates. Like Wayne Thompson, the Chief Data Scientist at SAS, who studied plant sciences but minored in statistics. His path through agricultural sciences is natural for many of our colleagues, since SAS was founded by a consortium of land-grant universities heavily funded by grants from the United States Department of Agriculture. Over the years many of our senior executives have had degrees in agricultural-related disciplines like forestry, agricultural economics, etc.

Juthika Khargaria, Ph.D. in Astrophysical and Planetary Sciences from University of Colorado, standing next to the 18” telescope at Sommers Bausch Observatory.

Juthika Khargaria, Ph.D. in Astrophysical and Planetary Sciences from University of Colorado, standing next to the 18” telescope at Sommers Bausch Observatory.



Consider Juthika, an analytics solutions architect who assists customers in defining their business problems and demonstrating how SAS advanced analytics solutions could help. But before joining SAS Juthika studied astrophysics, studying complex systems, statistics, and how to deal with abstract concepts. Juthika says that the data astrophysicists deal with has high noise  but low signal, so they are experienced in methods to tease out that signal. See how well Juthika bridges that gap in this blog post she wrote on using WAVELETS to separate the signal from the noise. Physicists also usually have strong computational skills, which is why we have hired several in Advanced Analytics R&D to develop our software.

4. Look beyond STEM departments to recruit from the social sciences

There are plenty of good reasons to recruit from the STEM (science, technology, engineering, and math) disciplines, since these fields provide their students an excellent foundation for analytical problem-solving. But there are good reasons to look to the social sciences as well. Phil Weiss is an analytical consultant who helps customers understand how our advanced analytics software might solve their business problems. He shared, “The value of a liberal arts degree cannot be understated when it comes to being able to more easily handle difficult conceptual problems and the multifarious nature of symbolic systems, especially programming languages….My statistics training and close association with ‘big data’ derived from depositional patterns allowed me to transition into computer science even though I had limited training in any STEM field.” In fact, as this article from Fast Company shows, many tech CEOs even prefer to hire from the liberal arts, arguing that these disciplines train students to “thrive in ambiguity and subjectivity,” which are hallmarks of any real business environment.

Phil Weiss (on left), ABD in Archaeology, Arizona State University on a dig while in grad school.

Phil Weiss (on left), ABD in Archaeology, Arizona State University on a dig while in grad school.

Most PhD programs in the social sciences require their students to take courses in the quantitative methods necessary to do data-driven research, so they may even have a more substantial foundation than you’d expect. The School of Social Welfare at the University of Wisconsin-Milwaukee even offers a Graduate Certificate in Applied Data Analysis Using SAS.  Significant research on statistical theory and quantitative methods is being done in colleges of education. There is an emerging field of computational journalism. And my colleague Ken Sanford has written and spoken extensively about why economists make great data scientists. So wander across campus to different buildings on your recruiting trips, and you may be delighted and surprised at what you find.

Next time:

5. Try before you buy - create an intern program

6. Sponsor foreign nationals


Missing unicorns - 10 tips on finding data scientists (Part 1)

missing unicornAs this article on the mythical data scientist describes, many people call this special kind of analytical talent "unicorns," because the breed can be so hard to find. In order to close the analytical talent gap that McKinsey Global Institute and others have predicted, and many of you experience today, SAS launched SAS Analytics U in March of this year to feed the pipeline for analytical talent. This higher education initiative aims to help address the skills gap by offering free versions of SAS software, university partnerships, and more. Yes, I did say free, and the free SAS® University Edition even runs on Mac, in addition to PC and Linux! Meanwhile, since data scientists can be hard to find, I’ll share with you ten tips to use in your hunt, illustrated with examples from some of our own legendary unicorns at SAS.

Five of the tips relate to academic recruiting:

1.  Hire from an MS in Analytics program
2.  Hire from a great program you’ve never heard of
3.  Recruit from untraditional disciplines
4.  Look beyond STEM - recruit from social sciences
5.  Try before you buy – create an intern program


Five more relate to other best practices:

6.  Invest in sponsorship for foreign nationals
7.  Use social networks to hire friends of unicorns
8.  Hire the curious who want to solve problems
9.  Think about what kind of data scientist you need
10.  Don’t expect unicorns to grow horns overnight


Each tip is worth expansion, so I'll share two in this post and more in subsequent posts.

Patrick in grad school cropped

Patrick Hall, MS in Analytics, NC State University

1. Hire from an MS in Analytics program

SAS proudly helped launch and continues to support the Institute for Advanced Analytics at NC State University, led by Dr. Michael Rappa, which is the granddaddy of them all, for good reason. Over 90% of their graduates get offers by graduation, because in their intensive 10-month program they receive not only an outstanding academic foundation but targeted attention to those “softer” skills like public speaking, team work, business problem identification and formulation, etc. that are so essential to the practice of analytics. Patrick Hall, pictured here while getting his MS in Analytics from this program, is one of our machine learning experts on the SAS® Enterprise Miner™ R&D team and even a certified data scientist, being one of the few to pass the rigorous Cloudera Certified Professional: Data Scientist (CCP:DS) exam. SAS works with scores of the exploding number of these programs and they can be great places to recruit graduates with training in analytics and experience using SAS software.


2011_12_OSU_students_placed_second_in_A2011_shootout1 cropped

Dr. C (far left), Murali Pagolu (fifth from left) and Satish Garla (sixth from left), both MS in Management of Information Systems/Analytics, Oklahoma State University

2. Hire from a great program you’ve never heard of

In addition to the many well-known programs, there are some great ones that you might not have heard of, like one at Oklahoma State University (OSU) run by Dr. Goutam Chakraborty (just call him Dr. C.), who has graduated 700+ unicorns in the last decade. Designed to recognize students with advanced knowledge of SAS, these joint certificate programs supported by the SAS Global Academic Program require students to complete a minimum of credit hours in relevant courses. Murali Pagolu and Satish Garla both received an MS in Management Information Systems/Analytics from this program and are pictured here when they were winners in the 2011 SAS Analytics Shootout, held annually at our Analytics Conference. Murali and Satish work in our Professional Services Division, helping customers implement their software by working with them to get their analytical models in place. They are just two of the many OSU graduates who have won countless awards. A large Midwestern manufacturing executive recently told me that he had to persuade his Human Resources Department to send a recruiting team to Stillwater, Oklahoma, but it paid off – they found two of their own unicorns there. Or convince HR to pay a visit to Kennesaw, Georgia, to visit Dr. Jennifer Priestley’s program run out of the statistics department at Kennesaw State University, which was recently cited by Computerworld as having the most innovative academic program in Big Data Analytics. There are many more programs like these around the country where you can recruit, so don't limit yourself to universities with which you are familiar.

I'll explain more tips and show more unicorns in future posts, but if you're attending the Analytics 2014 conference in Las Vegas October 20-21 there will be a virtual herd of SAS unicorns galloping around! I'll be giving a demo theater presentation on analytical talent where I'll give all ten of my tips for finding them. Stop me and say hi if you’re there – I always like meeting unicorns and could introduce you to many others we have there. Many others mentioned in this post are on the great conference agenda will be presenting:

  • Dr. Michael Rappa, who leads the Institute of Advanced Analytics at NC State University, will give a keynote session on "Solving the Analytics Talent Gap."
  • Patrick Hall, SAS unicorn and one of Rappa's former students, who will give a presentation on "An Overview of Machine Learning with SAS® Enterprise Miner™" and a super demo on "R integration in SAS® Enterprise Miner™."
  • Murali Pagolu, SAS unicorn and OSU graduate will present with his former professor, Dr. C, on "Unstructured Data Analysis: Real World Applications and Business Case Studies." Dr. C will bring 35 of his current and former students to the conference and has two teams who are finalists and another Honorable Mention in the 2014 Analytics Shootout.
  • Dr. Jennifer Priestley of Kennesaw State University will talk about "What You Don't Know About Education and Data Science."
  • Stop by the Networking Hall to visit booths on SAS Analytics U, the Global Academic Program, and programs from NCSU, OSU, and Kennesaw State University, as well as many other academic sponsors who run great programs you should add to your recruiting list.


(Unicorn poster image credit: photo by Arvind Grover// attribution by creative commons. Other photos courtesy of the unicorn pictured)

How discrete-event simulation can help project prison populations

In 2011, the passage of the federal Justice Reinvestment Act (JRA) brought significant changes to North Carolina’s criminal sentencing practices, particularly in relation to the supervision of offenders released into the community on probation or post-release supervision. A recent New York Times article highlighted how NC has used the implementation of the JRA to implement cost-saving strategies. Each year the NC Sentencing Commission prepares prison projections that are used by the state Department of Public Safety and the NC General Assembly to help determine correctional resource needs for adult offenders, but the changes resulting from the JRA placed a huge kink into the long-established process used to generate those projections. Use of discrete-event simulation software from SAS helped smooth out the kinks.

The NC Sentencing Commission had been using a simulation model written in C-based code to project NC’s prison population for more than twenty years. The changes imposed by the JRA required new functionality not available in the existing simulation. The Administrative Office of the Courts (AOC) ultimately contracted with SAS to develop a more flexible and transparent prison population projection model using discrete-event simulation.

Traditional time series methods are ineffective for prison population projections because of dynamic factors like sentence length, prior criminal history, revocations of community supervision, and legislative changes. As an alternative, the SAS Advanced Analytics and Optimization Services Group (AAOS) used SAS® Simulation Studio to build a discrete-event simulation model that approximates the journeys of offenders through the criminal justice system. In general, discrete-event simulation is used to model systems where the state of the model is dynamic and changes in the state (called events) occur only at countable, distinct points in time. Examples of events in the prison model include an arrival of an offender in prison or a probation violation.

Process flowcharts provided the framework for the Simulation Studio prison projection model (for those interested in more detail, these flowcharts can be found in a more extended paper on this model presented at SAS Global Forum in 2013). Even though most of the JRA provisions went into effect on December 1, 2011, there will be a period of time for which portions of the JRA do not apply to certain offenders. As a result, the new simulation model incorporates both pre-existing and new legislative policies.

The AAOS Group in R&D translated the logic contained in the flowcharts into a Simulation Studio model. The entities (or objects) that flow through the model represent criminal cases. They have attributes (or properties) such as case number, gender, race, age, and prison term. At simulation execution, case entities are generated and routed according to their attributes over a ten-year period. For example, when it is time for a case entity to be released from prison, a check is done to see if that entity qualifies for post-release supervision. If so, then the entity is routed to logic that samples a random number stream to determine whether or not that entity will commit a violation at some point in the future.

The inputs to the Simulation Studio model are in the form of SAS data sets and include the following:

  1. Stock data, provided by the Department of Public Safety’s Division of Adult Correction: includes inmates in prison at the beginning of the projection period and their projected release date.
  2. Court data, provided by the AOC: contains convictions and sentence imposed in the most recent fiscal year.
  3. Growth estimates: projected growth rate for convictions as determined by the Sentencing Commission’s forecasting advisory group after examining demographic trends, crime trends, and arrest trends.

The court and stock data include both individual-level information (such as demographics, offense, and sentence) as well as aggregate-level information (such as the probability of receiving an active sentence by offense class and the lag-time between placement on probation and a return to prison for a violation).

At the end of the simulation, two SAS data sets are generated, providing a complete history of prison admissions and releases over a ten-year time period. From this data, monthly and annual projections can be prepared at an aggregate level as well by variables of interest such as gender, race, age, and offense class.

After the AAOS group finished building the simulation model, it was handed over to the Sentencing Commission, along with documentation and training for the Simulation Studio modeling interface, so that the Sentencing Commission could then run the model and make changes as needed. They have used the model to prepare projections for two years now, with the first official results being published in February of 2013. Figure 1 shows the projected prison population and capacity for FY 2014 through FY 2023. The prison population is projected to increase from 37, 679 in June 2014 to 38,812 in June 2023, an increase of 3%. A comparison of the projections with the operating capacity indicates that the projected prison population will be below prison capacity for the ten-year projection period. In June 2014 the actual average prison population was 37,731, so the model-projected population of 37,679 was within 0.2% of the actual. The current projections, as well as projections from previous years, are located on the Sentencing Commission’s website.

This project demonstrates a very promising application of discrete-event simulation in practice. The resulting Simulation Studio model not only incorporates changes to correctional policies as a result of the JRA, but it can easily be modified by the Sentencing Commission to incorporate any future legislative acts that affect the prison process, providing the state the flexibility and transparency they desired. The model can also be extended to project other criminal justice populations (such as juveniles) in both NC and in other states.







Building a $1 billion machine learning model

At the KDD conference this week I heard a great invited presentation called How to Create a $1 billion Model in 20 days: Predictive Modeling in the Real World – A Sprint Case Study. It was presented by Tracey de Poalo from Sprint and former Kaggle President and well known machine learning expert Jeremy Howard (@jeremyphoward). Jeremy convinced Sprint’s CEO that machine learning could help their business, so he was brought on as a consultant to work with Tracey and her team. The result was the $1 billion model, which he called the highest value machine learning case he’s ever seen.

Jeremy had the executive blessing they needed to get access to key teams, so they conducted 40-50 interviews to identify which business problems to prioritize for their work. Based on these interviews they decided to prototype models for churn, application credit, behavioral credit, and cross-sell. When ready to tackle the data, Jeremy was impressed that they were ahead of the curve. Tracey’s team had already built a data mart of 10,000 features on each customer. Jeremy said their thorough and well-organized data dictionary was the best he’d seen in his career.

For a planned benchmarking exercise, Jeremy chose his favorite Kaggle-winning scripts from R packages caret and randomForest. Based on his past Kaggle success he was felt confident he’d beat her existing models. When the results were in he confessed he was shocked that his were almost the same as hers, which were based on logistic regression. Kudos to Jeremy for his refreshing honesty, as someone commented during the Q&A.

Tracey’s team’s process was very rigorous and completely automated process used: 1) missing value imputation; 2) outlier treatment; 3) variable reduction (getting them down by ~65%); 4) transformations; 5) VIF (limit to 10); 6) stepwise regression (down to ~ 1,000 variables); 7) model refitting (50-75 left). Jeremy was most amazed at Tracey's strategic use of variable clustering, commenting that it is an interesting approach that he hadn’t seen elsewhere. She ranked her variables by R2 and then picked one variable/cluster.

As a result of their work together their new model identified nine variables that explained the majority of bad debt. Combining these factors with customer credit data they were able to estimate customer lifetime value, which allowed them to quantify the cost for making a bad call on credit. Adding these costs up you reach $1 billion in value.

A history of machine learning in SAS

What I love about the machine learning model Tracey's team had in place is that it has its roots in a very early SAS procedure, VARCLUS, which goes back to at least the early 1980’s. As I wrote before, machine learning is not new territory for SAS. SAS implemented a k-means clustering algorithm in 1982 (as described in this paper with PROC FASTCLUS in SAS/STAT®), but after reading my post Warren Sarle pointed out that PROC DISCRIM did k-nearest-neighbor discriminant analysis at least as far back as SAS 79.  This early procedure was written by a certain J. H. Goodnight, who some may recognize as SAS founder and CEO.


A neural learning technique called the perceptron algorithm was developed as far back as 1958. But neural network research made slow progress until the early 1990’s, when the intersection of computer science and statistics reignited the popularity of these ideas. In Warren Sarle’s 1994 paper Neural Networks and Statistical Models (where I found the illustration to the left), he even says that “the most commonly used artificial neural networks, called multilayer perceptrons, are nothing more than non-linear regression and discriminant models that can be implemented with standard statistical software.” He then explains that he will translate “neural network jargon into statistical jargon.”

Flash forward to today, where this article from Forbes reports that the most popular course at Stanford is one on machine learning. It is popular once again, and the discussions and papers at KDD this week certainly reflected this trend. While machine learning is nothing new for SAS, there is a lot of new machine learning in SAS. You can read more on machine learning in SAS® Enterprise Miner in this paper and in SAS® Text Miner in this paper, to name just a few of our products with machine learning features. Now grab some and go build your own $1 billion model!