What makes the smart grid “smart”? Analytics, of course!

You’ve heard about the smart grid, but what is it that makes the grid smart? I’ve been working on a project with Duke Energy and NC State University doing time-series analysis on data from Phasor Measurement Units (PMUs) that illustrates the intelligence in the grid as well as an interesting application of analytical techniques. I presented some of our findings at the North American SynchroPhasor Initiative (NASPI) workgroup meeting in Houston recently, so I thought I’d share them with a broader audience.

PMUs in the power grid

PMUs in the power grid

Phasor Measurement Units (PMU) take measurements of the power transmission grid at a much higher speed and fidelity than previous systems provided. PMUs take measurements on the power frequency (i.e. 60hz), voltage, current, and phasor angle (i.e. where you are on the power sine wave). These units take readings at a speed of 30 measurements/second, while the previous systems just took readings every 3-4 seconds. This more frequent interval provides a much more detailed view of the power grid and allows detection of sub-second changes that were completely missed before.

Another great feature of the PMUs is their very accurate time measurement. PMUs are installed at points along the power grid miles apart from each other. For example, Duke Energy has over 100 PMUs installed across the Carolinas. To analyze data and learn about the whole grid, we need to synchronize the measurements taken at these locations. PMUs have Global Positioning System (GPS) receivers built in, not to determine the location, but so all can get the same accurate time signal. Since GPS systems provide time accuracy in the nano-second range, this is sufficient for our measurements at 30/second. This accuracy is most critical in the measurement of phasor angles. By comparing the phasor angles between locations, we get a measure of the power flow between the locations. Since the measurements are of something changing at a frequency of 60hz, the time stamp of the measurement must be of significantly higher precision than what is being measured.

Working with this data has highlighted the similarity and differences between working with big data and high-speed streaming data. Big data is typically a large amount of data that has been captured and stored for analysis. Streaming data is constantly coming in a high rate of speed and must be analyzed as it is being received. One of the many interesting things about this project is that is involves both big data and streaming data.

So what have we learned working with this data? The main purpose of this project is to detect and understand events that are affecting the power grid, with the objective of keeping the grid stable. We have learned there are a number of time-series techniques that are needed for the different aspects of providing the needed answers. The analysis flow breaks down into three areas: event detection (did something happen?), event identification (what happened?), and event quantification (how bad was it?).

For event detection, the task at hand is streaming data analysis. The system generates 30 measurements/second on hundreds of sensors and tags. Fortunately a majority of the time (>99.99%) they indicate that no event of any kind is occurring. Since there are time-series patterns present they can be modeled and used to detect when there is a deviation from the normal pattern. Determining these models allows us to look forward with a very short-term forecast and then instantly detect an event of interest.

Event identification is the next order of business. An event of interest doesn’t necessarily mean there is a problem or that one will develop. Some events are random, like a lightning strike or a tree hitting a power line. Others represent some type of equipment failure. We’ve determined that many of these events produce a similar ‘signature’ in the data stream, because time-series similarity analysis and time-series clustering have been able to match the incoming events to previously seen events. Knowing which previous event signatures are non-consequential allows us to safely ignore them.

Finally we look at event quantification. For some events, the question is not just that the event is occurring but also whether the magnitude of the event gives cause for concern. An example is oscillation on the power grid. Small but diminishing oscillations are not necessarily a problem, but larger ones that are increasing may require further attention. Once the event type is identified, each has some specialized techniques to determine their magnitude and consequence.

This project has provided interesting insights into how to make the power grid smarter. Many of these techniques are also beneficial to streaming data analysis seen in other industries and applications. If there is a need to automatically identify and categorize system events based on data patterns, or filter out events that are non-consequential, then these techniques will be helpful.

Photo credits

PMUs in the power grid: Syncrophasor Technologies and their Deployment in the Recovery Act Smart Grid Programs, August 2013 report by the US Department of Energy

Firefighter image credit: photo by US Navy // attribution by creative commons

Hoover Dam image credit: photo by IAmSanjeevan // attribution by creative commons

Post a Comment

Are there jobs for economists in analytics?

SAS will again be participating in the Allied Social Science Association annual meetings in January. This year the event will be held in Boston, and conference organizers expect more than 12,000 participants from a variety of backgrounds, including economics, finance and many other social sciences. One of the primary functions of the event, aside from traditional academic sessions, is that it serves as a single meeting place for employers and job candidates. Each year, approximately 1,000 candidates attend ASSA for the sole purpose of finding a job. As I’ve written before, corporate economists are hot again and a great source for analytical talent, so if you’re on the job market consider exploring a career in industry. It’s a great place for economists to look for a job.

I’ve worked in academia and industry, so I know both worlds. And in my current role I constantly talk with economists performing analytical functions in some of the world’s largest companies. What I have seen is that both worlds provide ample opportunity to utilize your economist skills. In fact, I would conclude that a corporate analytic role will challenge your data skills in a way unlike academia. You will be pushed to learn both methods and as an economist you will be uniquely positioned to explain results to colleagues. There is, of course, no ‘free lunch,’ so your research will be guided by the firm’s revenue maximization or cost minimization priorities.

As you may know, SAS is one of the preferred analytical computing platforms by both businesses and government. In fact, Monster.com ranked it #1 on their list of “Job Skills that Lead to Bigger Paychecks.” For these reasons I was curious to conduct a little research about jobs using SAS. I chose to search the JOE website (Job Openings for Economists), the primary listing source for jobs in economics. I limited my search to positions listed as “Full-time nonacademic,” as universities are not likely to prefer one computing language over another. Of the 247 nonacademic listings, here is what I found for each program.

Search Term Number of jobs and link
SAS 41
Stata 34
Matlab 27
R 16
Python 11
SQL* 9

*while SQL is not a computational program I would consider it a language.

SAS was explicitly listed in 41 job descriptions. Each number above is a hyperlink to the actual search, should you be interested in seeing the jobs available.

For those attending the ASSA conference, please stop by the SAS booth in the exhibit hall and say hello. And if you want to talk about the market for economists in industry or would like to meet some of this year’s job market candidates or industry representatives, please consider attending the SAS Academic and Industry Reception to be held on Sunday, January 4th at 5pm. If you are not attending ASSA but will be in the area on that date, we’d be glad to have you join us at this reception. Please click here to RSVP.

Don’t be fooled: there’s really only two basic types of distributed processing

Every time I pick up a new article about analytics, I am always disappointed by the fact that I cannot find any specifics mentioned about back-end processing. It is no secret that every vendor wishes they had the latest and greatest parallel processing capabilities, but the truth is that many software vendors are still bound by single-threaded processing – as indicated by their obvious reticence about discussing details on the subject. As a result of using older approaches to data processing, most competitors will toss around terms like ‘in-memory’ and ‘distributed processing’ to sow confusion about how their stuff actually works. I will explain the difference in this post and tell you why you should care.

The truth is that there are really only two basic types of distributed processing, namely multi-processing (essentially grid–enabled networks) and pooled-memory massive parallel processing (MPP). Multi-processing essentially consists of duplicate sets of instructions being sent to an array of interconnected processing nodes. In the latter scenario, each node has its own allocation of CPU, RAM, and data, and generally does not have the ability to communicate or share information with other nodes in the same array. While a large multi-step job can be chopped up in pieces and each piece processed in parallel, the multi-processing configuration is still largely limited by duplicate, single-threaded sets of instructions that need to run.

Contrast multi-processing with a pooled-memory architecture that has inter-node communication and does not require duplicate sets of instructions. Each node in a pooled-resource configuration can work on a different part of a problem, large or small. If any node needs more resources, data, or information from any of the other nodes, it can get what it needs by issuing messages to any of the other nodes. This makes for a truly ‘shared resources environment,’ and as a consequence it runs about ten times faster than the fastest multi-processing array configuration.

Now much of the confusion about these two types of distributed processing exists because of misuse of the term ‘in-memory’. The fact is that ALL data processing occurs in-memory at some point in the execution of a set of code instructions. So ‘in-memory’ is really a misnomer for distributed processing. For example, traditional SAS processing has always occurred in-memory as blocks of data are read from disk into RAM. As RAM allocations have gotten larger, more data has been loaded into memory, yet the instructions were still processed using a single-threaded and sequential approach. What was needed was a rewrite of the software to enable multi-threading, namely routing separate tasks to different processors. Combining a multi-threaded program with all data pre-loaded into memory produces the phenomenally fast run-times as compared to what was able to be accomplished before.

Even though a program is multi-threaded, there is still no guarantee that things will run faster. An obvious example is Mahout, an Apache project that relied on MapReduce to facilitate inter-node communication in a pooled-resource environment. MapReduce is notoriously slow, as nodes take a long time to load data into memory and must write inter-node communication requests to disk before other nodes can access the request. As a consequence of its lethargic response time, Mahout has largely been abandoned by most large business customers in favor of faster architectures.

Message Passing Interface (MPI) is a much faster communication protocol, because it runs in-memory and it can accomplish multiple data iterations that are common to predictive analytics work. Currently there are only two MPI initiatives that offer true multi-threading, one based on Spark, an in-memory plug-in to Hadoop, and SAS’ High Performance Analytics. Spark development is still in its infancy, comparatively speaking, and it will likely be years before any push-button applications can make use of its capabilities. Alternatively, SAS has products that are production-ready today and can dramatically shorten your analytics lifecycle. So, do not be fooled by claims of in-memory or distributed processing, because MPI-enabled pooled-memory processing is here to stay and bodes well to become the de facto standard for all future predictive analytics processing.

For many standard analytics jobs, your standard architecture may be sufficient. But these phenomenally-fast run times matter when you are trying to process dozens, if not hundreds, of tournaments that consist of the most advanced machine learning techniques like random forests and deep-learning neural networks. Statistical professionals are finding that these new techniques are not only more accurate, but they also allow us to investigate much lower levels of granularity than ever before. As a result, models are getting more precise and profitability is increasing concomitantly. So if you want to solve more problems faster and with more accuracy (plus use the same headcount), be sure to investigate claims of “in-memory” and choose the right architecture for your job.

Econometric reflections from Analytics 2014

This post will violate the “what happens in Vegas stays in Vegas” rule, because last week I had the pleasure of attending and participating in the Analytics 2014 event there and want to share some of what I heard for those who couldn’t attend. I was joined by over 1,000 attendees and colleagues as we gathered to share best practices in the fields of statistics, econometrics, forecasting, text analytics, optimization, data mining, and more. My talk, demo theatre presentation, and exhibit hall hours gave me the opportunity to meet many interesting people, so here are some of my thoughts based on what I heard and learned.

  • Jan Chvosta and I presented, “Why Econometrics Should Be in Your Analytics Toolkit: Applications of Causal Inference” (presentation available here) to approximately 100 attendees and sincerely appreciated the feedback we received. Of note to me was just how many audience members approached us afterward and said that “causal interpretation” is what they strive for with their predictive modeling. From marketing mix models to CCAR stress testing to price elasticity estimates, I saw many nodding heads when we talked the importance of interpretation in these models. To twist the words of Nobel Laureate Robert Lucas, “once you start thinking about causality, it is hard to think about anything else.” It appears to me that there are still many people interested in the meaning of models in this world of “big data” and “machine learning.”
  • I was able to attend Michele Trovero’s talk on using SAS/ETS® tools to estimate linear and non-linear state-space models. Michele showed several example and benefits of the existing SSM procedure as well as giving a look ahead to the upcoming ESMX procedure. Once released, this procedure will allow new exponential smoothing models to be estimated in SAS as well as provide a statistical treatment to structural time series models based on exponential smoothing.  Use this link to find his slides or listen to this interview to hear Michele describe some of these tools.
  • At the econometrics booth I was asked about many different topics. We talked about state-space models (PROC SSM), time series data management with PROC TIMEDATA and the use of multinomial discrete choice models for price elasticity measurement (PROC MDC). However, far and away the dominating topic was the new CCAR compliance regulations. For those unfamiliar with these regulations, it is a requirement that bank stress tests incorporate macroeconomic scenarios into the forecasts of their financial performance with the goal being to ensure bank solvency under adverse economic events. One of the difficulties of this problem is that the objective of the analysis is no longer strictly a predictive modeling problem. New policies dictate that certain variables must be included and that these effects should behave consistently with economic theory. No fewer than ten banks independently asked about modeling techniques to satisfy these regulations. It was quite interesting, because each bank had a different method of solving the problem. Some banks with access to micro data chose to model probabilities of default based on each asset. Other banks without access to this data have opted for a time-series based approach. For some time now, I have been working on formalizing these methods and I will present a paper on the subject both at the Conference on Statistical Practice in February as well as SAS Global Forum in April.
  • It was my pleasure to lead a roundtable discussion about statistical and econometric modeling in health insurance. It was a packed table and I apologize if you were turned away. Many of the topics discussed during our causality talk were echoed during the roundtable, most notably non-random assignment of certain interventions. In fact, one large health insurer spoke about the early returns of the Affordable Care Act with respect to substitution effects, or lack thereof, from emergency room usage to traditional clinics. This may suggest that the population now covered by new rules remains unlikely to shift their healthcare usage from high-cost emergency room visits to less costly outpatient facilities.
  • Finally I would like to thank the 15 or so attendees at my 7:30am demo theatre presentation on the new items in SAS/ETS®. These were brave and dedicated souls to be such early risers in Vegas! I always enjoy the chance to evangelize our new tools. We spent a great deal of time talking about PROC HPCDM, a tool for simulating aggregate loss distributions in insurance and banking. People were interested because there now is a very computationally efficient way to simulate aggregate losses subject to business rules about deductibles and limits. We also talked about new methods of estimating limited dependent models in the presence of endogenous regressors. There was an interesting question about tools for spatial econometric models, which isn't part of the current portfolio but will definitely be part of future presentations.
Ken Sanford being interviewed on the demo floor at Analytics 2014

Ken Sanford being interviewed on the demo floor at Analytics 2014 (click on the photo for the interview)

Unicorn hunting: finding data scientists outside traditional academic disciplines

Finding people with the range of skills classified as data science can be a challenge, which is why some call them unicorns (do they really exist?), so I recently posted ten tips on finding unicorns. In my first post I elaborated on tips 1 and 2 (1. hire from an MS in Analytics program and 2. hire from a great program you've never heard of). In this post I'll share two more tips, which entail hunting for data scientists beyond the math, stats, computer science, operations research, and engineering departments where you might most expect to find this kind of talent.

3. Recruit from untraditional disciplines

Wayne Thompson, PhD in Plant Sciences from the University of Tennessee, during his year spent as a visiting scientist at the Institut superieur d'agriculture de Lille in France.

Wayne Thompson, PhD in Plant Sciences from the University of Tennessee, during his year spent as a visiting scientist at the Institut superieur d'agriculture de Lille in France.

As this article from Inc. points out, computer science may not be the best place to find data scientists. In fact, the article refers to a survey of data scientists, of whom 51% recommend that the best source of data scientists is outside of computer science. For that matter, if you limit yourself to other, perhaps more “traditional,” analytical disciplines you may be overlooking some great candidates. Like Wayne Thompson, the Chief Data Scientist at SAS, who studied plant sciences but minored in statistics. His path through agricultural sciences is natural for many of our colleagues, since SAS was founded by a consortium of land-grant universities heavily funded by grants from the United States Department of Agriculture. Over the years many of our senior executives have had degrees in agricultural-related disciplines like forestry, agricultural economics, etc.

Juthika Khargaria, Ph.D. in Astrophysical and Planetary Sciences from University of Colorado, standing next to the 18” telescope at Sommers Bausch Observatory.

Juthika Khargaria, Ph.D. in Astrophysical and Planetary Sciences from University of Colorado, standing next to the 18” telescope at Sommers Bausch Observatory.



Consider Juthika, an analytics solutions architect who assists customers in defining their business problems and demonstrating how SAS advanced analytics solutions could help. But before joining SAS Juthika studied astrophysics, studying complex systems, statistics, and how to deal with abstract concepts. Juthika says that the data astrophysicists deal with has high noise  but low signal, so they are experienced in methods to tease out that signal. See how well Juthika bridges that gap in this blog post she wrote on using WAVELETS to separate the signal from the noise. Physicists also usually have strong computational skills, which is why we have hired several in Advanced Analytics R&D to develop our software.

4. Look beyond STEM departments to recruit from the social sciences

There are plenty of good reasons to recruit from the STEM (science, technology, engineering, and math) disciplines, since these fields provide their students an excellent foundation for analytical problem-solving. But there are good reasons to look to the social sciences as well. Phil Weiss is an analytical consultant who helps customers understand how our advanced analytics software might solve their business problems. He shared, “The value of a liberal arts degree cannot be understated when it comes to being able to more easily handle difficult conceptual problems and the multifarious nature of symbolic systems, especially programming languages….My statistics training and close association with ‘big data’ derived from depositional patterns allowed me to transition into computer science even though I had limited training in any STEM field.” In fact, as this article from Fast Company shows, many tech CEOs even prefer to hire from the liberal arts, arguing that these disciplines train students to “thrive in ambiguity and subjectivity,” which are hallmarks of any real business environment.

Phil Weiss (on left), ABD in Archaeology, Arizona State University on a dig while in grad school.

Phil Weiss (on left), ABD in Archaeology, Arizona State University on a dig while in grad school.

Most PhD programs in the social sciences require their students to take courses in the quantitative methods necessary to do data-driven research, so they may even have a more substantial foundation than you’d expect. The School of Social Welfare at the University of Wisconsin-Milwaukee even offers a Graduate Certificate in Applied Data Analysis Using SAS.  Significant research on statistical theory and quantitative methods is being done in colleges of education. There is an emerging field of computational journalism. And my colleague Ken Sanford has written and spoken extensively about why economists make great data scientists. So wander across campus to different buildings on your recruiting trips, and you may be delighted and surprised at what you find.

Next time:

5. Try before you buy - create an intern program

6. Sponsor foreign nationals


Missing unicorns - 10 tips on finding data scientists (Part 1)

missing unicornAs this article on the mythical data scientist describes, many people call this special kind of analytical talent "unicorns," because the breed can be so hard to find. In order to close the analytical talent gap that McKinsey Global Institute and others have predicted, and many of you experience today, SAS launched SAS Analytics U in March of this year to feed the pipeline for analytical talent. This higher education initiative aims to help address the skills gap by offering free versions of SAS software, university partnerships, and more. Yes, I did say free, and the free SAS® University Edition even runs on Mac, in addition to PC and Linux! Meanwhile, since data scientists can be hard to find, I’ll share with you ten tips to use in your hunt, illustrated with examples from some of our own legendary unicorns at SAS.

Five of the tips relate to academic recruiting:

1.  Hire from an MS in Analytics program
2.  Hire from a great program you’ve never heard of
3.  Recruit from untraditional disciplines
4.  Look beyond STEM - recruit from social sciences
5.  Try before you buy – create an intern program


Five more relate to other best practices:

6.  Invest in sponsorship for foreign nationals
7.  Use social networks to hire friends of unicorns
8.  Hire the curious who want to solve problems
9.  Think about what kind of data scientist you need
10.  Don’t expect unicorns to grow horns overnight


Each tip is worth expansion, so I'll share two in this post and more in subsequent posts.

Patrick in grad school cropped

Patrick Hall, MS in Analytics, NC State University

1. Hire from an MS in Analytics program

SAS proudly helped launch and continues to support the Institute for Advanced Analytics at NC State University, led by Dr. Michael Rappa, which is the granddaddy of them all, for good reason. Over 90% of their graduates get offers by graduation, because in their intensive 10-month program they receive not only an outstanding academic foundation but targeted attention to those “softer” skills like public speaking, team work, business problem identification and formulation, etc. that are so essential to the practice of analytics. Patrick Hall, pictured here while getting his MS in Analytics from this program, is one of our machine learning experts on the SAS® Enterprise Miner™ R&D team and even a certified data scientist, being one of the few to pass the rigorous Cloudera Certified Professional: Data Scientist (CCP:DS) exam. SAS works with scores of the exploding number of these programs and they can be great places to recruit graduates with training in analytics and experience using SAS software.


2011_12_OSU_students_placed_second_in_A2011_shootout1 cropped

Dr. C (far left), Murali Pagolu (fifth from left) and Satish Garla (sixth from left), both MS in Management of Information Systems/Analytics, Oklahoma State University

2. Hire from a great program you’ve never heard of

In addition to the many well-known programs, there are some great ones that you might not have heard of, like one at Oklahoma State University (OSU) run by Dr. Goutam Chakraborty (just call him Dr. C.), who has graduated 700+ unicorns in the last decade. Designed to recognize students with advanced knowledge of SAS, these joint certificate programs supported by the SAS Global Academic Program require students to complete a minimum of credit hours in relevant courses. Murali Pagolu and Satish Garla both received an MS in Management Information Systems/Analytics from this program and are pictured here when they were winners in the 2011 SAS Analytics Shootout, held annually at our Analytics Conference. Murali and Satish work in our Professional Services Division, helping customers implement their software by working with them to get their analytical models in place. They are just two of the many OSU graduates who have won countless awards. A large Midwestern manufacturing executive recently told me that he had to persuade his Human Resources Department to send a recruiting team to Stillwater, Oklahoma, but it paid off – they found two of their own unicorns there. Or convince HR to pay a visit to Kennesaw, Georgia, to visit Dr. Jennifer Priestley’s program run out of the statistics department at Kennesaw State University, which was recently cited by Computerworld as having the most innovative academic program in Big Data Analytics. There are many more programs like these around the country where you can recruit, so don't limit yourself to universities with which you are familiar.

I'll explain more tips and show more unicorns in future posts, but if you're attending the Analytics 2014 conference in Las Vegas October 20-21 there will be a virtual herd of SAS unicorns galloping around! I'll be giving a demo theater presentation on analytical talent where I'll give all ten of my tips for finding them. Stop me and say hi if you’re there – I always like meeting unicorns and could introduce you to many others we have there. Many others mentioned in this post are on the great conference agenda will be presenting:

  • Dr. Michael Rappa, who leads the Institute of Advanced Analytics at NC State University, will give a keynote session on "Solving the Analytics Talent Gap."
  • Patrick Hall, SAS unicorn and one of Rappa's former students, who will give a presentation on "An Overview of Machine Learning with SAS® Enterprise Miner™" and a super demo on "R integration in SAS® Enterprise Miner™."
  • Murali Pagolu, SAS unicorn and OSU graduate will present with his former professor, Dr. C, on "Unstructured Data Analysis: Real World Applications and Business Case Studies." Dr. C will bring 35 of his current and former students to the conference and has two teams who are finalists and another Honorable Mention in the 2014 Analytics Shootout.
  • Dr. Jennifer Priestley of Kennesaw State University will talk about "What You Don't Know About Education and Data Science."
  • Stop by the Networking Hall to visit booths on SAS Analytics U, the Global Academic Program, and programs from NCSU, OSU, and Kennesaw State University, as well as many other academic sponsors who run great programs you should add to your recruiting list.


(Unicorn poster image credit: photo by Arvind Grover// attribution by creative commons. Other photos courtesy of the unicorn pictured)

How discrete-event simulation can help project prison populations

In 2011, the passage of the federal Justice Reinvestment Act (JRA) brought significant changes to North Carolina’s criminal sentencing practices, particularly in relation to the supervision of offenders released into the community on probation or post-release supervision. A recent New York Times article highlighted how NC has used the implementation of the JRA to implement cost-saving strategies. Each year the NC Sentencing Commission prepares prison projections that are used by the state Department of Public Safety and the NC General Assembly to help determine correctional resource needs for adult offenders, but the changes resulting from the JRA placed a huge kink into the long-established process used to generate those projections. Use of discrete-event simulation software from SAS helped smooth out the kinks.

The NC Sentencing Commission had been using a simulation model written in C-based code to project NC’s prison population for more than twenty years. The changes imposed by the JRA required new functionality not available in the existing simulation. The Administrative Office of the Courts (AOC) ultimately contracted with SAS to develop a more flexible and transparent prison population projection model using discrete-event simulation.

Traditional time series methods are ineffective for prison population projections because of dynamic factors like sentence length, prior criminal history, revocations of community supervision, and legislative changes. As an alternative, the SAS Advanced Analytics and Optimization Services Group (AAOS) used SAS® Simulation Studio to build a discrete-event simulation model that approximates the journeys of offenders through the criminal justice system. In general, discrete-event simulation is used to model systems where the state of the model is dynamic and changes in the state (called events) occur only at countable, distinct points in time. Examples of events in the prison model include an arrival of an offender in prison or a probation violation.

Process flowcharts provided the framework for the Simulation Studio prison projection model (for those interested in more detail, these flowcharts can be found in a more extended paper on this model presented at SAS Global Forum in 2013). Even though most of the JRA provisions went into effect on December 1, 2011, there will be a period of time for which portions of the JRA do not apply to certain offenders. As a result, the new simulation model incorporates both pre-existing and new legislative policies.

The AAOS Group in R&D translated the logic contained in the flowcharts into a Simulation Studio model. The entities (or objects) that flow through the model represent criminal cases. They have attributes (or properties) such as case number, gender, race, age, and prison term. At simulation execution, case entities are generated and routed according to their attributes over a ten-year period. For example, when it is time for a case entity to be released from prison, a check is done to see if that entity qualifies for post-release supervision. If so, then the entity is routed to logic that samples a random number stream to determine whether or not that entity will commit a violation at some point in the future.

The inputs to the Simulation Studio model are in the form of SAS data sets and include the following:

  1. Stock data, provided by the Department of Public Safety’s Division of Adult Correction: includes inmates in prison at the beginning of the projection period and their projected release date.
  2. Court data, provided by the AOC: contains convictions and sentence imposed in the most recent fiscal year.
  3. Growth estimates: projected growth rate for convictions as determined by the Sentencing Commission’s forecasting advisory group after examining demographic trends, crime trends, and arrest trends.

The court and stock data include both individual-level information (such as demographics, offense, and sentence) as well as aggregate-level information (such as the probability of receiving an active sentence by offense class and the lag-time between placement on probation and a return to prison for a violation).

At the end of the simulation, two SAS data sets are generated, providing a complete history of prison admissions and releases over a ten-year time period. From this data, monthly and annual projections can be prepared at an aggregate level as well by variables of interest such as gender, race, age, and offense class.

After the AAOS group finished building the simulation model, it was handed over to the Sentencing Commission, along with documentation and training for the Simulation Studio modeling interface, so that the Sentencing Commission could then run the model and make changes as needed. They have used the model to prepare projections for two years now, with the first official results being published in February of 2013. Figure 1 shows the projected prison population and capacity for FY 2014 through FY 2023. The prison population is projected to increase from 37, 679 in June 2014 to 38,812 in June 2023, an increase of 3%. A comparison of the projections with the operating capacity indicates that the projected prison population will be below prison capacity for the ten-year projection period. In June 2014 the actual average prison population was 37,731, so the model-projected population of 37,679 was within 0.2% of the actual. The current projections, as well as projections from previous years, are located on the Sentencing Commission’s website.

This project demonstrates a very promising application of discrete-event simulation in practice. The resulting Simulation Studio model not only incorporates changes to correctional policies as a result of the JRA, but it can easily be modified by the Sentencing Commission to incorporate any future legislative acts that affect the prison process, providing the state the flexibility and transparency they desired. The model can also be extended to project other criminal justice populations (such as juveniles) in both NC and in other states.







Building a $1 billion machine learning model

At the KDD conference this week I heard a great invited presentation called How to Create a $1 billion Model in 20 days: Predictive Modeling in the Real World – A Sprint Case Study. It was presented by Tracey de Poalo from Sprint and former Kaggle President and well known machine learning expert Jeremy Howard (@jeremyphoward). Jeremy convinced Sprint’s CEO that machine learning could help their business, so he was brought on as a consultant to work with Tracey and her team. The result was the $1 billion model, which he called the highest value machine learning case he’s ever seen.

Jeremy had the executive blessing they needed to get access to key teams, so they conducted 40-50 interviews to identify which business problems to prioritize for their work. Based on these interviews they decided to prototype models for churn, application credit, behavioral credit, and cross-sell. When ready to tackle the data, Jeremy was impressed that they were ahead of the curve. Tracey’s team had already built a data mart of 10,000 features on each customer. Jeremy said their thorough and well-organized data dictionary was the best he’d seen in his career.

For a planned benchmarking exercise, Jeremy chose his favorite Kaggle-winning scripts from R packages caret and randomForest. Based on his past Kaggle success he was felt confident he’d beat her existing models. When the results were in he confessed he was shocked that his were almost the same as hers, which were based on logistic regression. Kudos to Jeremy for his refreshing honesty, as someone commented during the Q&A.

Tracey’s team’s process was very rigorous and completely automated process used: 1) missing value imputation; 2) outlier treatment; 3) variable reduction (getting them down by ~65%); 4) transformations; 5) VIF (limit to 10); 6) stepwise regression (down to ~ 1,000 variables); 7) model refitting (50-75 left). Jeremy was most amazed at Tracey's strategic use of variable clustering, commenting that it is an interesting approach that he hadn’t seen elsewhere. She ranked her variables by R2 and then picked one variable/cluster.

As a result of their work together their new model identified nine variables that explained the majority of bad debt. Combining these factors with customer credit data they were able to estimate customer lifetime value, which allowed them to quantify the cost for making a bad call on credit. Adding these costs up you reach $1 billion in value.

A history of machine learning in SAS

What I love about the machine learning model Tracey's team had in place is that it has its roots in a very early SAS procedure, VARCLUS, which goes back to at least the early 1980’s. As I wrote before, machine learning is not new territory for SAS. SAS implemented a k-means clustering algorithm in 1982 (as described in this paper with PROC FASTCLUS in SAS/STAT®), but after reading my post Warren Sarle pointed out that PROC DISCRIM did k-nearest-neighbor discriminant analysis at least as far back as SAS 79.  This early procedure was written by a certain J. H. Goodnight, who some may recognize as SAS founder and CEO.


A neural learning technique called the perceptron algorithm was developed as far back as 1958. But neural network research made slow progress until the early 1990’s, when the intersection of computer science and statistics reignited the popularity of these ideas. In Warren Sarle’s 1994 paper Neural Networks and Statistical Models (where I found the illustration to the left), he even says that “the most commonly used artificial neural networks, called multilayer perceptrons, are nothing more than non-linear regression and discriminant models that can be implemented with standard statistical software.” He then explains that he will translate “neural network jargon into statistical jargon.”

Flash forward to today, where this article from Forbes reports that the most popular course at Stanford is one on machine learning. It is popular once again, and the discussions and papers at KDD this week certainly reflected this trend. While machine learning is nothing new for SAS, there is a lot of new machine learning in SAS. You can read more on machine learning in SAS® Enterprise Miner in this paper and in SAS® Text Miner in this paper, to name just a few of our products with machine learning features. Now grab some and go build your own $1 billion model!

Why corporate economists are hot again and a great source for analytical talent

A while back The Wall Street Journal published the article “Corporate Economists Are Hot Again“ that chronicles the resurgence of in-house economists in corporate America. The role of a corporate economist may bring about visuals of classic economist stereotypes (watch Ben Stein play to this stereotype as a teacher in the great 1986 movie Ferris Bueller's Day Off - search for "anyone, anyone" and the movie title for a good laugh). These types of prognosticators were popular in the 1970’s and 1980’s as companies attempted to turn the volatile macroeconomic environment into a competitive advantage. The subsequent near-twenty-year economic expansion and decreasingly volatile economy reduced the need for full-time economists, since the future continued to appear near-certain. Recently, economists are being hired again, but this time it is for a completely different reason, one that I have been evangelizing since my start at SAS. Economists are great source for analytical talent. They have all the necessary skills, which is why many companies are hiring them into these roles.  Economists are poised to break in to data science roles for these five reasons:

  1. We understand objective functions: Economists love objective functions, since they dictate how the players in a system behave. This can be important in both predicting outcomes as well as in conducting analysis. If the objective is to understand how price affects quantity, variable selection mechanisms cannot be used because they would eliminate the price variable.
  2. Economists have a very strong linear regression toolkit: While economists often do not have the depth of statistical methods that a formally-trained statistician has (we miss out on clustering and variable reduction, to name a few), we know what we know with great depth. And fortunately, very few problems require more than linear regression. There is one subtle tweak to an economist’s regression toolkit, which is….
  3. We own observational data and causality:  Economists never assume we have the luxury of experimental data. We always assume that the data are rife with issues such as measurement error, censoring and sample selection. For these reasons, economists have tweaked their regression training to address all these problems. Nearly all the corporate customers of SAS I have met model data generated outside a lab. The data are collected retroactively and have all the problems listed above and more.
  4. Articulating the problem and the solution: This reason is closely tied to the first point. Economists can talk about the problem and explain the solution. I have heard my fellow economists call this trait “storytelling (hat tip to John Moreau).” I think term that perfectly describes our skills here. SAS customers often tell me that they like the way economists conduct regression, because they look at the coefficients to verify they align with theory. Part of the storytelling proficiency is skill at explaining what incentives led to this response. Other disciplines tend to focus on statistical fit rather than explanation.
  5. We work with big data: While this might not be immediately obvious, economists are very skilled with dealing with data that are uncomfortably large. Nearly every labor or health economics course requires a data replication project involving multiple years of the US Census Bureau’s Current Population Survey or their 5-percent Public Use Microdata Sample (PUMS). These datasets easily are multiple gigabytes in size and require programming efficiency to process.

In fact, perhaps one of the most famous advocates of the “economist as data scientist” argument is Hal Varian. While his comment about statisticians being sexy is far better known, he is an economist himself, and the full quote sums it up best:

I keep saying the sexy job in the next ten years will be statisticians. People think I’m joking, but who would’ve guessed that computer engineers would’ve been the sexy job of the 1990s? The ability to take data—to be able to understand it, to process it, to extract value from it, to visualize it, to communicate it—that’s going to be a hugely important skill in the next decades, not only at the professional level but even at the educational level for elementary school kids, for high school kids, for college kids. Because now we really do have essentially free and ubiquitous data. So the complimentary scarce factor is the ability to understand that data and extract value from it.

I think statisticians are part of it, but it’s just a part. You also want to be able to visualize the data, communicate the data, and utilize it effectively. But I do think those skills—of being able to access, understand, and communicate the insights you get from data analysis—are going to be extremely important. Managers need to be able to access and understand the data themselves.” –Hal Varian, Chief Economist, Google[1]

Too bad he didn't call economists sexy.

So what holds economists back? I have my theories. I believe there are three key areas we must address: 1) terminology, 2) methodology and 3) technology. I will elaborate on these during my upcoming talk at the National Association for Business Economics Annual Meeting in Chicago September 27-30. If you find yourself in the area, I hope you can attend.

Looking backwards, looking forwards: SAS, data mining, and machine learning

Looking forward, ten of my SAS colleagues and I are heading to New York City this weekend for KDD 2014: Data Science for the Social Good, which runs August 24-27. This event’s full name is the 20th Association for Computing Machinery Special Interest Group on Knowledge Discovery and Data Mining, but it is more commonly known as ACM SIGKDD, or just KDD for short.

Looking backwards, the first KDD workshop was held in 1989, and these workshops eventually grew into the series of conferences. Whether you still call it data mining, or prefer machine learning or data science, the fact that this year the conference is sold out, with the 2,200 registered exceeding all expectations, is a sign of the trending of this topic. KDD’s tagline today is “bringing together the data mining, data science, and analytics community,” so this nexus is right where SAS has played for years. In fact, the picture below is taken from a data mining primer course SAS offered in 1998.

data mining Venn diagram







The SAS story starts with the statistics circle above, when the language was first developed in 1966, multiple regression and ANOVA were added in 1968, the first licenses sold in 1972, and the company incorporated in 1976. SAS moved into the data mining and machine learning circle early, when in 1982 the FASTCLUS procedure implemented k-means clustering. But while there’s more to this history, I’ll save it for another post and return to a forward-looking view.

I’m looking forward to hearing a keynote on Sunday night by Pedro Domingos (Department of Computer Science and Engineering at the University of Washington), who is the 2014 winner of the ACM SIGKDD Innovation Award and will be giving the talk associated with that award at the conference. I found his paper A Few Useful Things to Know about Machine Learning to be an excellent resource. On Monday morning Oren Etzioni (Executive Director of the Allen Institute for Artificial Intelligence, from the same department at the University of Washington) will give a talk on “The Battle for the Future of Data Mining,” which certainly will inform my forward-looking view. It will be interesting to hear where he thinks the field is heading, and where the battles will lie.

On Monday morning, right after we’ve heard Dr. Etzioni look to the future, my own colleague Zheng Zhao will give a paper he co-authored with our fellow SAS peers James Cox and Jun Liu on “Safe and Efficient Screening For Sparse Support Vector Machine” in the Feature Selection Research Track. In this paper, a novel screening technique is proposed to accelerate model selection for SVM and effectively improve its scalability. The emergence of big-data analysis poses new challenges for model selection with large-scale data that consist of tens of millions samples and features. This technique can precisely identify inactive features in the optimal solution of an SVM model and remove them before training. Experimental results on five high-dimensional benchmark data sets demonstrate the power of the proposed technique.

SAS will be in the exhibit hall with a booth (#14). In addition to talking about the products SAS offers for machine learning, we will be talking about our new SAS Analytics U initiative, which includes SAS® University Edition, a free, downloadable version of select SAS statistical software that runs on PCs, Macs, and Linux and is designed for teaching and learning SAS. We'll also be giving away some copies of our colleague Jared Dean's new book, Big Data, Data Mining, and Machine Learning: Value Creation for Business Leaders and Practitioners. In the booth on Monday and Tuesday we will also offer what we call superdemos, which are 15-minute long demos on focused topics. Here is the list:

Monday, August 25, 10:00-10:15 a.m.
Deep learning for dimensionality reduction/visualization
Jorge Silva
We will showcase deep learning with PROC NEURAL, using a deep auto-encoder architecture to visualize clustering results on medical provider data. 
Monday, August 25, 1:00-1:15 p.m.
Contextual Recommendation using Text Analysis
Yue Qi
The collaborative filtering-based recommender is prone to the cold start problem and long tail problem, so this demo will show how to derive contextual recommendations using text analysis to address both problems.
Monday, August 25, 3:00-3:15 p.m.
Time series dimension reduction for data mining using SAS
Catherine Lopes
This demo introduces SAS procedures for time series dimension reduction in data mining.
Monday, August 25, 5:00-5:15 p.m.
New techniques for doing association classification and a demonstration of their usefulness for mining text
Jim Cox
We will describe two new algorithms for pattern discovery with a single consequent or external category: Bool-yer and AssoCat.
Tuesday, August 26, 10:00-10:15 a.m.
R integration node
Jorge Silva
This demo will illustrate the diagram and workflow user interface and also focus on how people can try their favorite R algorithms while taking advantage of data handling and pre-processing capabilities built into SAS® Enterprise Miner.
Tuesday, August 26, 1:00-1:15 p.m.
Classification Using Bayesian Networks in SAS® Enterprise Miner
Weihua Shi
Using a newly developed high-performance Bayesian network procedure (PROC HPBNET), this demo will illustrate the graphic-modeling approach using a real-world data.
Tuesday, August 26, 3:00-3:15 p.m.
Interactive Stratified Modeling using SAS® Visual Statistics
Wayne Thompson
This demo will show how to develop stratified models based on group-by variables, decision trees to derive segments and enforce business rules, and clustering demographic data followed by supervised models using transactional data.

If you are already planning on attending KDD, come by booth #14 and see us. If you didn’t register in advance you’re probably out of luck, since the conference is sold out. But I plan to blog again after the conference and will offer some impressions from the event, as well sharing some more history about SAS, data mining, and machine learning, continuing with my backward and forward looks.