Three final talent tips: how to hire data scientists

How do we hire data scientists at SAS, since we are not unique in our search for a rare talent type that continues to be in high demand? This post is the last in a series on finding data scientists, based on best practices at SAS and illustrated with some of our own “unicorns.” You can read my first blog post on calling them unicorns and for tips 1 and 2 on finding them in an MS in Analytics Program or from a great program you may not have heard of. You can read tips 3 and 4 on how to find this kind of talent outside the traditional STEM academic disciplines. And tips 5, 6, and 7 detail the value we’ve found in intern programs, social networks, and sponsorship of foreign nationals.

This last post focuses on less tangible aspects, related to curiosity, clarity about what kind of data scientist you need, and having appropriate expectations when you hire.

8. Look for people with curiosity and a desire to solve problems

Radhika Kulkarni, PhD in Operations Research, Cornell University, teaching calculus as a grad student.

Radhika Kulkarni, PhD in Operations Research, Cornell University, teaching calculus as a grad student.

As I blogged previously, Greta Roberts of Talent Analytics will tell you that the top traits to look for when hiring analytical talent are curiosity, creativity, and discipline, based on a study her organization did of data scientists. It is important to discover if your candidates have these traits, because they are necessary elements to find a practical solution and separate candidates from those who may get lost in theory. My boss Radhika Kulkarni, the VP of Advanced Analytics R&D at SAS, self-identified this pattern when she arrived at Cornell to pursue a PhD in math. This realization prompted her to switch to operations research, which she felt would allow her to pursue investigating practical solutions to problems, which she preferred to more theoretical research.

That passion continues today, as you can hear Radhika describe in this video on moving the world with advanced analytics. She says “We are not creating algorithms in an ivory tower and throwing it over the fence and expecting that somebody will use it someday. We actually want to build these methods, these new procedures and functionality to solve our customers’ problems.” This kind of practicality is another key trait to evaluate in your job candidates, in order to avoid the pitfall of hires who are obsessed with finding the “perfect” solution. Often, as Voltaire observed, “Perfect is the enemy of good.” Many leaders of analytical teams struggle with data scientists who haven’t yet learned this lesson. Beating a good model to death for that last bit of lift leads to diminishing returns, something few organizations can afford in an ever-more competitive environment. As an executive customer recently commented during the SAS Analytics Customer Advisory Board meeting, there is an “ongoing imperative to speed up that leads to a bias toward action over analysis. 80% is good enough.”

9. Think about what kind of data scientist you need

Ken Sanford, PhD in Economics, University of Kentucky, speaking about how economists make great data scientists at the 2014 National Association of Business Economists Annual Meeting. (Photo courtesy of NABE)

Ken Sanford, PhD in Economics, University of Kentucky, speaking about how economists make great data scientists at the 2014 National Association of Business Economists Annual Meeting. (Photo courtesy of NABE)

Ken Sanford describes himself as a talking geek, because he likes public speaking. And he's good at it. But not all data scientists share his passion and talent for communication. This preference may or may not matter, depending on the requirements of the role. As this Harvard Business Review blog post points out, the output of some data scientists will be to other data scientists or to machines. If that is the case, you may not care if the data scientist you hire can speak well or explain technical concepts to business people. In a large organization or one with a deep specialization, you may just need a machine learning geek and not a talking one! But many organizations don’t have that luxury. They need their data scientists to be able to communicate their results to broader audiences. If this latter scenario sounds like your world, then look for someone with at least the interest and aptitude, if not yet fully developed, to explain technical concepts to non-technical audiences. Training and experience can work wonders to polish the skills of someone with the raw talent to communicate, but don’t assume that all your hires must have this skill.

10. Don’t expect your unicorns to grow their horns overnight

Annelies Tjetjep, M.Sc., Mathematical Statistics and Probability from the University of Sydney, eating frozen yogurt.

Annelies Tjetjep, M.Sc., Mathematical Statistics and Probability from the University of Sydney, eating frozen yogurt.

Annie Tjetjep relates development for data scientists to frozen yogurt, an analogy that illustrates how she shines as a quirky and creative thinker, in addition to working as an analytical consultant for SAS Australia. She regularly encounters customers looking for data scientists who have only chosen the title, without additional definition. She explains: “…potential employers who abide by the standard definitions of what a ‘data scientist’ is (basically equality on all dimensions) usually go into extended recruitment periods and almost always end up somewhat disappointed - whether immediately because they have to compromise on their vision or later on because they find the recruit to not be a good team player….We always talk in dimensions and checklists but has anyone thought of it as a cycle? Everyone enters the cycle at one dimension that they're innately strongest or trained for and further develop skills of the other dimensions as they progress through the cycle - like frozen yoghurt swirling and building in a cup.... Maybe this story sounds familiar... An educated statistician who picks up the programming then creativity (which I call confidence), which improves modelling, then business that then improves modelling and creativity, then communication that then improves modelling, creativity, business and programming, but then chooses to focus on communication, business, programming and/or modelling - none of which can be done credibly in Analytics without having the other dimensions. The strengths in the dimensions were never equally strong at any given time except when they knew nothing or a bit of everything - neither option being very effective - who would want one layer of froyo? People evolve unequally and it takes time to develop all skills and even once you develop them you may choose not to actively retain all of them.”

So perhaps you hire someone with their first layer of froyo in place and expect them to add layers over time. In other words, don't expect your data scientists to grow their unicorn horns overnight. You can build a great team if they have time to develop as Annie describes, but it is all about having appropriate expectations from the beginning.

To learn more, check out this series from SAS on data scientists, where you can read Patrick Hall's post on the importance of keeping the science in data science, interviews with data scientists, and more.

And if you want to check out what a talking geek sounds like, Ken will be speaking at a National Association of Business Economists event next week in Boston - Big Data Analytics at Work: New Tools for Corporate and Industry Economics. He'll share the stage with another talking geek, Patrick Hall, a SAS unicorn I wrote about it in my first post.

Is machine learning trending with economists?

I am noticing a trend. At the ASSA meetings in January (where economics, sociology and finance academics and practitioners gather to discuss their research) I was surprised to see how much “machine learning” was trending with economists. The session “Machine Learning Methods in Economics and Econometrics,” with papers by Susan Athey (Microsoft and Stanford) and Pat Bajari (Amazon and University of Washington), was one of the most popular at the conference. Both authors are joining a group of economists that are cautiously dipping their toes into the area of predictive modeling known as machine learning (ML). While ML tools have been becoming increasingly popular with computer scientists, only recently have economists embraced the value of some of these methods.

The second piece of evidence suggesting that machine learning is trending with economists came from a conference I recently attended at the National Academy of Sciences called, “Drawing Causal Inference from Big Data”. The more than 400 attendees from academia, government and industry heard papers from top academics working to merge the field of statistical inference or causality with the tools typically used with “big data.” Here were my highlights from the conference, with recordings of the talks included as links.

  • Michael Jordan  talked on the intersection of statistical computing and inference. He reviewed the literature of inference under constraints.  My favorite part was the concept of “Bag of Little Bootstraps,” which is a method to assess quality of estimators from a large dataset.
  • Judea Pearl talked on the importance of the causality story with “big data.” For example, “subjects of the big data system (patients for instance) will attempt to pull causality from the users of big data (doctors)”…. So correlation won’t be enough for long.  He spent a lot of time talking about the problems of observational data and causality (with a lot of reliance on DAGs).
  • Thomas Richardson spent his time talking on using observational pharmaceutical data to inform efficacy. David Heckerman from Microsoft talked about personalized medicine based on genetic information.  He spent some time explaining his work he calls FaST-LMM.
  • Bernard Scholkopf explained how causal models can help machine learning and gave an overview of his research in this area.
  • Susan Athey’s talk was about how using trees to improve causal inference. This talk was my favorite, because it was an overview of many different ML methods that can assist an economist in model specification as well as lots on cross-validation and heterogeneous treatment effects.

The third piece of evidence for this trend is a shameless plug for an upcoming talk I am giving. At an event hosted by NABE, Big Data Analytics at Work: New Tools for Corporate and Industry Economics, Patrick Hall, Chief Machine Learning Scientist at SAS, and I will give economists an introduction to the methods and technology needed to get started with ML methods. We plan to discuss many of the methods that can be used to glean insight from large data sets, whether they are long or wide. There are very few seats left for the conference, which is June 16-17, so if you haven’t already done so, sign up now! It will be an excellent introduction to both methods and potential applications. I will make sure to post our slides after the conference. Let’s see if the trend toward machine learning in economics continues!

Numeric validation for analytical software testing

There is a job category unfamiliar to most people that plays a crucial role in the creation of analytics software. Most can surmise that SAS hires software developers with backgrounds in statistics, econometrics, forecasting or operations research to create our analytical software; however, most do not realize there is another group of people who work closely with individual developers to test their code. For the analytics products at SAS these people are called analytics testers. What do they do?

At SAS, verifying the correctness of procedural output is termed “numeric validation.” This process consists of independently checking and verifying all the numeric output created by a developer in a SAS procedure or function. Just as SAS has invested in a large stable of talented developers with advanced degrees in specialized areas of statistics, econometrics, forecasting, operations research and mathematics, SAS has also invested in an equivalent stable of analytics testers with advanced degrees in the same specialty areas. One primary responsibility of an analytics tester is to ensure numeric correctness independent of the developer, which they typically do by replicating the method in alternate code. Think of dueling PhD’s racing to implement the same algorithm but in different ways. Part of ensuring numeric validation means their results must agree, because agreement provides greater assurance that the implementation and output of the algorithm is correct.

numeric validation

Developer Ying So on the left and analytical tester Yu Liang on the right, who "dueled" over the survival analysis work described in this post

But what happens when they do not agree? This happens quite often during the software development process. As you might imagine, this leads to a lot of head-scratching and white-boarding to figure the disparity out. When the numbers are different, the numbers for both the developer and tester have been called into question. Who is correct? Did the developer implement the statistical algorithm in the C language the same way the analytics tester did in SAS/IML, or vice versa? Was there a different interpretation of the algorithm from the source material like a journal article? Figuring out the discrepancy can take time to converge, due to the subtle nature and abstract interpretation required of many algorithms and mathematical approaches that require years of training in order to understand. Codifying those subtleties and abstractions into closed-form, robust and well-tested C code that produces the correct values for SAS customers can be quite arduous, and so is numeric validation of the results using an independent pathway.

Let me give you a concrete example. One of the analytics testers for SAS/STAT who is responsible for testing and validating a heavily used SAS procedure used in drug trials identified an issue with the cumulative incidence function (CIF) that was part of a new feature under development. Her numbers did not match output from the procedure the developer created. She notified the developer and thus began a three week exercise analyzing why their numbers differed. The tester had to write a 900 line SAS/IML program to independently calculate the CIF, because there were no other independent means to validate it. After much discussion, the developer determined that that the analytics tester’s approach was technically correct and adjusted his C code for the procedure accordingly.

On the surface one might think this is just two statisticians arguing over a seemingly arcane issue, but the computation is critical in a field of statistics referred to as survival analysis. Biostatisticians and medical researchers use survival analysis to determine which factors increase the probability of survival for subjects in medical studies. Life-altering decisions are made based on the results of this analysis, so it is not hyperbole to say that numeric validation can be a life and death matter.

SAS analytic software is utilized the world over to drive decisions in nearly every area of scientific research, business, and government policy setting. Ensuring SAS software is well-tested and numerically correct is a key factor to the integrity of driving those decisions, and analytic testers are a core part of that process. A few years ago SAS CEO Jim Goodnight said, “SAS was, is and always will be a collection of people who put the needs of the customer first, produce quality software unmatched by any other, and thrive in an innovative workplace that serves as a model across the globe.” This week at SAS is the second annual Quality Week, which is dedicated to raising awareness of the importance of software quality and empowering employees to help solve quality challenges. As SAS employees shine attention on quality this week I am proud to lead a team with a particular contribution to quality related to numeric validation.

 

Easter forecasting challenges the Easter Bunny brings

The date of Easter influences our leisure activities

Easter forecasting challengesDifferent from many other public holidays, Easter is a so-called movable holiday. This means that the Easter bunny brings more than just eggs for the statistician - he brings special Easter forecasting challenges. In the year 325 CE the Council for Nicea determined that Easter would fall on the Sunday after the first full moon in spring. The earliest date for Easter is thus March 22nd, the latest April 25th. Carl Friedrich Gauss took this rule into consideration and developed the Gauss algorithm for Easter, which allows the determination of the date of Easter for a given year.
The moving date of Easter matters, because this holiday has a very strong influence on our leisure activities. Many people use the feast of Easter to take a spring holiday, which in turn has a strong impact on the hotel, restaurant, and transportation industries. Our choice of activities also depends on the timing of Easter. For many of my fellow Europeans, when Easter falls at the end of March, we may plan a ski vacation, while an April date may call for plans that involve enjoying the warming spring weather instead of seeking the snow.

Analysis of the data mirror that behavior

Time series data on a monthly basis reflects this behavior. In a multi-year observation you often see that the peaks switch between March and April.

  • When Easter falls in March, winter sports resorts in the Austrian Alps often have a full hotels in that month. In April, however, occupancy drops very sharply, because Easter often also means end of the season for the ski lifts.
  • The effect of the importance of the Easter holidays can also be observed by the number of air and rail travelers or visitors to recreational facilities.

The following graph shows the number of visitors at a leisure park during the period from 1993 to 2000. In order to facilitate recognition, only the months March (blue) and April (red) are shown. In addition a green step curve shows whether the beginning of the Easter falls in March or April in each year.

Svolba_EasterEffect
It is easy to see that for a holiday that starts in March, the blue line is higher than the red line. For a holiday that starts in April, the reverse is true. We can clearly see the behavior of consumers and tourists can in our analysis of the data.

In the evaluation of results, the consideration of this fact is important. From a sales perspective it does not make sense, to panic when the March sales fall behind those of the previous year when the "Easter Peak" is expected only in April.

Easter forecasting challenges

A peak that varies between March and April can be a problem for time series forecasting models. Many simple time series models only use the historical values of the time series. If Easter falls in April in two consecutive years, the models expect the next peak again in April, because the model has only learned the seasonal effect, leading to these Easter forecasting challenges.

This fact is important because there is an increasing trend for planning software packages to offer only a few simple standard methods of time series forecasting. For some applications a simple seasonal exponential smoothing model does quite well. But it takes a little more effort to consider the above situations. You do not want to manually correct your budget figures every time for the “Easter effect.” It is tedious to always explain your own numbers in a planning meeting with the caution,"Yes, but this year...."

Event, inputs and flexible time intervals make a difference

The forecasting solution from SAS is built to consider these points for you:

  • Automatic model selection checks for you whether the time series can be forecast with a simple model or whether a more powerful model should be used.
  • Your individual data are analyzed to select the most appropriate model from a model library and fit it to your data.
  • The forecast considers input variables like the number of workdays, Saturdays, or Sundays, as well as already-booked orders.
  • Pre-defined calendar events like Easter or Christmas can be considered in the analysis.
  • You can define individual events that are important for your company or organization that influence the course of your key performance indicators.
  • You have the possibility of adapting the length of time intervals to the seasonal pattern of your processes.

Next year, Easter takes place on March 27th; in 2017 it will fall on April 16th. Check now to see whether your forecasting software and the resulting forecast numbers consider this shift.

SAS Global Forum, another moveable date

Another event that has a moving date is SAS Global Forum. This year it takes place April 26th-29th in Dallas, Texas.
My paper, Want an Early Picture of the Data Quality Status of Your Analysis Data? SAS® Visual Analytics Shows You How“ is scheduled for Wednesday, April 29th, 10:00-10:50 a.m. (Preview the A2014 Demo Version). I look forward to meeting you there!

 

The unicorn quest continues – more tips on finding data scientists (part 3)

Because finding analytical talent continues to be a challenge for most, here I offer tips 5, 6, and 7 of my ten tips for finding data scientists, based on best practices at SAS and illustrated with some of our own “unicorns.” You can read my first blog post for why they are called unicorns and for tips 1 and 2 on finding them in an MS in Analytics Program or from a great program you’ve never heard of. You can read tips 3 and 4 on how to find this kind of talent outside the traditional academic disciplines. Today’s post focuses on interns, social networks, and sponsorship for permanent residency.

5. Try before you buy – create an intern program

Golbarg Tutunchi (ABD* in Industrial Engineering/Operations Research, NC State University), Fatemeh Sayyady (PhD in Operations Research, NC State University), Zohreh Asgharzadeh (PhD in Operations Research, NC State University), and Shahrzad Azizzadeh (ABD* in Operations Research, NC State University).

Golbarg Tutunchi (ABD* in Industrial Engineering/Operations Research, NC State University), Fatemeh Sayyady (PhD in Operations Research, NC State University), Zohreh Asgharzadeh (PhD in Operations Research, NC State University), and Shahrzad Azizzadeh (ABD* in Operations Research, NC State University).

Intern programs are a great way for both the employer and the student to find out if the candidate is a good fit at the organization, so SAS hires close to 200 students each year. In addition to the standard internship programs that we run, we have two special programs that are particularly useful in our quest to find analytical talent. The Graduate Research Assistant (GRA) program places PhD students from local universities into Advanced Analytics R&D for up to 20 hours/week during the year and full-time during the summer (we had 14 in 2014). These students work on research in SAS related to their graduate training, and we maintain a partnership with their academic advisor. SAS provides funding to the academic departments, who in turn provide a stipend, tuition, and benefits for the students. Academics benefit from exposure to problems in practice, SAS benefits from academic insight into our research areas, and the students receive funding but also exposure to practice. R&D also has technical student positions that function like regular internships. R&D hires most of these students when they finish, including the women above who are part of the growing Iranian diaspora at SAS—Fatemeh (who works on SAS® Marketing Optimization), Zohreh (who leads a team of operations research specialists who develop solutions for retail and manufacturing), Shahrzad (who tests our mixed integer programming solver and decomposition algorithm in SAS/OR®), and Golbarg (a current GRA student who works on our Advanced Analytics and Optimization Services team, which provides specialized consulting on optimization and simulation problems).

Advanced Analytics R&D also has an Analytical Summer Fellows program, which hires PhD students for a special summer internship to expose them to the world of commercial statistical software research and development and provide an opportunity to explore software development as a career choice. In addition to a standard summer internship salary, the students receive a stipend, which allows us to recruit students from around the US. The program began with one student in statistics in 2006, and thanks to our CEO Jim Goodnight’s support and enthusiasm for programs of this kind, we have now expanded it to include positions for students in statistics, data mining, econometrics, operations research, and text analytics.

 6. Use social networks to hire the friends of your unicorns

Left circle, Zohreh Asgharzadeh, and right circle, Shahrzad Azizzadeh, at Sangan Waterfall, northwest of Tehran, with classmates from their Sharif University of Technology Industrial Engineering class.

Left circle, Zohreh Asgharzadeh, and right circle, Shahrzad Azizzadeh, at Sangan Waterfall, northwest of Tehran, with classmates from their Sharif University of Technology Industrial Engineering class.

We know the power of social networks from an analytical perspective, but have you thought about using them in your recruiting? At SAS, 55% of our hires come from an employee referral. After all, while merit is critical choosing the best candidate to hire for a given role, having a “spotter” help you fill the pipeline can help immensely, since the pool of unicorns is small in the first place. And who better to know where to hunt for unicorns than a member of their own species? Our employees are eager to recruit people they know, because they consistently tell us they want to work with smart people.

Remember Zohreh and Shahrzad, two of our former GRA students? They have been close friends since high school in Iran. Zohreh first came to the US to graduate school at NC State University, to pursue her PhD in Operations Research. Shahrzad was still in Iran, first pursuing a MS in Industrial Engineering and then working in industry. Hearing about her friend’s positive experiences in graduate school in the US, Shahrzad decided to apply to a US graduate school herself. Zohreh worked as a GRA student at SAS on retail optimization solutions, and because she enjoyed her work opted to work at SAS full-time upon graduation. Later, when Shahrzad saw an advertisement for a GRA placement at SAS, she asked her friend Zohreh about it. Zohreh encouraged her to apply, and the rest is history. And Fatemeh (from the photo above) applied for a position in part because she had a positive impression of SAS conveyed by her friends Shahrzad and Zohreh.

These two long-time friends have quite the academic pedigree, since their math tutor in high school was Maryam Mirzakhani, who last year won the Fields Medal, widely considered the Nobel Prize of math. As the first woman in its 78- year history to have won she is a trail blazer. These women challenge the tradition of Western world, where far fewer women choose the STEM disciplines, because they were encouraged to pursue their interest in math. Curious about that difference, I found an interesting article that provides a cross-cultural analysis of students with exceptional math talent and provides some recommendations. Its conclusion is sobering: "In summary, some Eastern European and Asian countries frequently produce girls with profound ability in mathematical problem solving; most other countries, including the USA, do not." So while efforts are being made to close that gap, perhaps an additional tip should be that if you want to hire more female unicorns, seek out the Iranians!

7. Be willing to invest in sponsorship for foreign nationals

Bahadir Aral (PhD in Industrial and Systems Engineering from Texas A&M University), in front of the Süleymaniye Mosque in Istanbul.

Bahadir Aral (PhD in Industrial and Systems Engineering from Texas A&M University), in front of the Süleymaniye Mosque in Istanbul.

"In today's global economy, innovation is the key to sustained growth and success. At SAS, we have long recognized this fact; it is why we are so committed to attracting and retaining the best and brightest minds from across the globe," wrote Jim Goodnight in this opinion piece on immigration reform. Part of his interest in immigration derives from the fact that economic forecasts project that American universities will produce 1 million fewer graduates from the science, technology, engineering and math disciplines (STEM) than the US workforce needs. And he points out that American universities are not currently able to meet that need with US-born students. In the US we are fortunate that top students from around the world are drawn to our excellent higher education system, so American companies can recruit top global talent domestically. An obvious repercussion is that some of these graduates will need legal sponsorship for the papers to stay in your country. As our CEO wrote, at SAS we feel this can be an investment with high return - the fees are not significant and the talent available is high.

For example, Bahadir is a pricing expert and consultant in our Advanced Analytics and Optimization Services group in R&D. Even though he came to SAS with almost five years of professional experience, he still needed to complete paperwork for a green card to be able to work in the United States. When his work visa ran out he had to return to his native Turkey and work from our Istanbul office for 6.5 months while the next step in his immigration application was being processed. Bahadir is so dedicated that he maintained his customer meetings in spite of the seven hour time zone difference, which meant he was sometimes scheduled for telephone calls that approached the midnight hour in Turkey! Another pricing expert on his team, Natalia, is Russian by birth but as a child moved to Mexico, where she obtained her first PhD in industrial engineering at Tecnológico de Monterrey. She then moved to the US to pursue a second PhD, this time in Operations Research, from North Carolina State University. She arrived at SAS with dual Russian and Mexican citizenship but didn’t yet have permanent residency, so SAS is also sponsoring her application.

Natalia Viktorovna Summerville in front of St. Basil’s Cathedral on Red Square in Moscow.

Natalia Viktorovna Summerville in front of St. Basil’s Cathedral on Red Square in Moscow.

Last three tips next time:

8.  Hire the curious who want to solve problems
9.  Think about what kind of data scientist you need
10.  Don’t expect unicorns to grow horns overnight

What makes the smart grid “smart”? Analytics, of course!

You’ve heard about the smart grid, but what is it that makes the grid smart? I’ve been working on a project with Duke Energy and NC State University doing time-series analysis on data from Phasor Measurement Units (PMUs) that illustrates the intelligence in the grid as well as an interesting application of analytical techniques. I presented some of our findings at the North American SynchroPhasor Initiative (NASPI) workgroup meeting in Houston recently, so I thought I’d share them with a broader audience.

PMUs in the power grid

PMUs in the power grid

Phasor Measurement Units (PMU) take measurements of the power transmission grid at a much higher speed and fidelity than previous systems provided. PMUs take measurements on the power frequency (i.e. 60hz), voltage, current, and phasor angle (i.e. where you are on the power sine wave). These units take readings at a speed of 30 measurements/second, while the previous systems just took readings every 3-4 seconds. This more frequent interval provides a much more detailed view of the power grid and allows detection of sub-second changes that were completely missed before.

Another great feature of the PMUs is their very accurate time measurement. PMUs are installed at points along the power grid miles apart from each other. For example, Duke Energy has over 100 PMUs installed across the Carolinas. To analyze data and learn about the whole grid, we need to synchronize the measurements taken at these locations. PMUs have Global Positioning System (GPS) receivers built in, not to determine the location, but so all can get the same accurate time signal. Since GPS systems provide time accuracy in the nano-second range, this is sufficient for our measurements at 30/second. This accuracy is most critical in the measurement of phasor angles. By comparing the phasor angles between locations, we get a measure of the power flow between the locations. Since the measurements are of something changing at a frequency of 60hz, the time stamp of the measurement must be of significantly higher precision than what is being measured.

Working with this data has highlighted the similarity and differences between working with big data and high-speed streaming data. Big data is typically a large amount of data that has been captured and stored for analysis. Streaming data is constantly coming in a high rate of speed and must be analyzed as it is being received. One of the many interesting things about this project is that is involves both big data and streaming data.

So what have we learned working with this data? The main purpose of this project is to detect and understand events that are affecting the power grid, with the objective of keeping the grid stable. We have learned there are a number of time-series techniques that are needed for the different aspects of providing the needed answers. The analysis flow breaks down into three areas: event detection (did something happen?), event identification (what happened?), and event quantification (how bad was it?).

For event detection, the task at hand is streaming data analysis. The system generates 30 measurements/second on hundreds of sensors and tags. Fortunately a majority of the time (>99.99%) they indicate that no event of any kind is occurring. Since there are time-series patterns present they can be modeled and used to detect when there is a deviation from the normal pattern. Determining these models allows us to look forward with a very short-term forecast and then instantly detect an event of interest.

Event identification is the next order of business. An event of interest doesn’t necessarily mean there is a problem or that one will develop. Some events are random, like a lightning strike or a tree hitting a power line. Others represent some type of equipment failure. We’ve determined that many of these events produce a similar ‘signature’ in the data stream, because time-series similarity analysis and time-series clustering have been able to match the incoming events to previously seen events. Knowing which previous event signatures are non-consequential allows us to safely ignore them.

Finally we look at event quantification. For some events, the question is not just that the event is occurring but also whether the magnitude of the event gives cause for concern. An example is oscillation on the power grid. Small but diminishing oscillations are not necessarily a problem, but larger ones that are increasing may require further attention. Once the event type is identified, each has some specialized techniques to determine their magnitude and consequence.

This project has provided interesting insights into how to make the power grid smarter. Many of these techniques are also beneficial to streaming data analysis seen in other industries and applications. If there is a need to automatically identify and categorize system events based on data patterns, or filter out events that are non-consequential, then these techniques will be helpful.

Photo credits

PMUs in the power grid: Syncrophasor Technologies and their Deployment in the Recovery Act Smart Grid Programs, August 2013 report by the US Department of Energy

Firefighter image credit: photo by US Navy // attribution by creative commons

Hoover Dam image credit: photo by IAmSanjeevan // attribution by creative commons

Are there jobs for economists in analytics?

SAS will again be participating in the Allied Social Science Association annual meetings in January. This year the event will be held in Boston, and conference organizers expect more than 12,000 participants from a variety of backgrounds, including economics, finance and many other social sciences. One of the primary functions of the event, aside from traditional academic sessions, is that it serves as a single meeting place for employers and job candidates. Each year, approximately 1,000 candidates attend ASSA for the sole purpose of finding a job. As I’ve written before, corporate economists are hot again and a great source for analytical talent, so if you’re on the job market consider exploring a career in industry. It’s a great place for economists to look for a job.

I’ve worked in academia and industry, so I know both worlds. And in my current role I constantly talk with economists performing analytical functions in some of the world’s largest companies. What I have seen is that both worlds provide ample opportunity to utilize your economist skills. In fact, I would conclude that a corporate analytic role will challenge your data skills in a way unlike academia. You will be pushed to learn both methods and as an economist you will be uniquely positioned to explain results to colleagues. There is, of course, no ‘free lunch,’ so your research will be guided by the firm’s revenue maximization or cost minimization priorities.

As you may know, SAS is one of the preferred analytical computing platforms by both businesses and government. In fact, Monster.com ranked it #1 on their list of “Job Skills that Lead to Bigger Paychecks.” For these reasons I was curious to conduct a little research about jobs using SAS. I chose to search the JOE website (Job Openings for Economists), the primary listing source for jobs in economics. I limited my search to positions listed as “Full-time nonacademic,” as universities are not likely to prefer one computing language over another. Of the 247 nonacademic listings, here is what I found for each program.

Search Term Number of jobs and link
SAS 41
Stata 34
Matlab 27
R 16
Python 11
SPSS 5
SQL* 9

*while SQL is not a computational program I would consider it a language.

SAS was explicitly listed in 41 job descriptions. Each number above is a hyperlink to the actual search, should you be interested in seeing the jobs available.

For those attending the ASSA conference, please stop by the SAS booth in the exhibit hall and say hello. And if you want to talk about the market for economists in industry or would like to meet some of this year’s job market candidates or industry representatives, please consider attending the SAS Academic and Industry Reception to be held on Sunday, January 4th at 5pm. If you are not attending ASSA but will be in the area on that date, we’d be glad to have you join us at this reception. Please click here to RSVP.

Don’t be fooled: there’s really only two basic types of distributed processing

Every time I pick up a new article about analytics, I am always disappointed by the fact that I cannot find any specifics mentioned about back-end processing. It is no secret that every vendor wishes they had the latest and greatest parallel processing capabilities, but the truth is that many software vendors are still bound by single-threaded processing – as indicated by their obvious reticence about discussing details on the subject. As a result of using older approaches to data processing, most competitors will toss around terms like ‘in-memory’ and ‘distributed processing’ to sow confusion about how their stuff actually works. I will explain the difference in this post and tell you why you should care.

The truth is that there are really only two basic types of distributed processing, namely multi-processing (essentially grid–enabled networks) and pooled-memory massive parallel processing (MPP). Multi-processing essentially consists of duplicate sets of instructions being sent to an array of interconnected processing nodes. In the latter scenario, each node has its own allocation of CPU, RAM, and data, and generally does not have the ability to communicate or share information with other nodes in the same array. While a large multi-step job can be chopped up in pieces and each piece processed in parallel, the multi-processing configuration is still largely limited by duplicate, single-threaded sets of instructions that need to run.

Contrast multi-processing with a pooled-memory architecture that has inter-node communication and does not require duplicate sets of instructions. Each node in a pooled-resource configuration can work on a different part of a problem, large or small. If any node needs more resources, data, or information from any of the other nodes, it can get what it needs by issuing messages to any of the other nodes. This makes for a truly ‘shared resources environment,’ and as a consequence it runs about ten times faster than the fastest multi-processing array configuration.

Now much of the confusion about these two types of distributed processing exists because of misuse of the term ‘in-memory’. The fact is that ALL data processing occurs in-memory at some point in the execution of a set of code instructions. So ‘in-memory’ is really a misnomer for distributed processing. For example, traditional SAS processing has always occurred in-memory as blocks of data are read from disk into RAM. As RAM allocations have gotten larger, more data has been loaded into memory, yet the instructions were still processed using a single-threaded and sequential approach. What was needed was a rewrite of the software to enable multi-threading, namely routing separate tasks to different processors. Combining a multi-threaded program with all data pre-loaded into memory produces the phenomenally fast run-times as compared to what was able to be accomplished before.

Even though a program is multi-threaded, there is still no guarantee that things will run faster. An obvious example is Mahout, an Apache project that relied on MapReduce to facilitate inter-node communication in a pooled-resource environment. MapReduce is notoriously slow, as nodes take a long time to load data into memory and must write inter-node communication requests to disk before other nodes can access the request. As a consequence of its lethargic response time, Mahout has largely been abandoned by most large business customers in favor of faster architectures.

Message Passing Interface (MPI) is a much faster communication protocol, because it runs in-memory and it can accomplish multiple data iterations that are common to predictive analytics work. Currently there are only two MPI initiatives that offer true multi-threading, one based on Spark, an in-memory plug-in to Hadoop, and SAS’ High Performance Analytics. Spark development is still in its infancy, comparatively speaking, and it will likely be years before any push-button applications can make use of its capabilities. Alternatively, SAS has products that are production-ready today and can dramatically shorten your analytics lifecycle. So, do not be fooled by claims of in-memory or distributed processing, because MPI-enabled pooled-memory processing is here to stay and bodes well to become the de facto standard for all future predictive analytics processing.

For many standard analytics jobs, your standard architecture may be sufficient. But these phenomenally-fast run times matter when you are trying to process dozens, if not hundreds, of tournaments that consist of the most advanced machine learning techniques like random forests and deep-learning neural networks. Statistical professionals are finding that these new techniques are not only more accurate, but they also allow us to investigate much lower levels of granularity than ever before. As a result, models are getting more precise and profitability is increasing concomitantly. So if you want to solve more problems faster and with more accuracy (plus use the same headcount), be sure to investigate claims of “in-memory” and choose the right architecture for your job.

Econometric reflections from Analytics 2014

This post will violate the “what happens in Vegas stays in Vegas” rule, because last week I had the pleasure of attending and participating in the Analytics 2014 event there and want to share some of what I heard for those who couldn’t attend. I was joined by over 1,000 attendees and colleagues as we gathered to share best practices in the fields of statistics, econometrics, forecasting, text analytics, optimization, data mining, and more. My talk, demo theatre presentation, and exhibit hall hours gave me the opportunity to meet many interesting people, so here are some of my thoughts based on what I heard and learned.

  • Jan Chvosta and I presented, “Why Econometrics Should Be in Your Analytics Toolkit: Applications of Causal Inference” (presentation available here) to approximately 100 attendees and sincerely appreciated the feedback we received. Of note to me was just how many audience members approached us afterward and said that “causal interpretation” is what they strive for with their predictive modeling. From marketing mix models to CCAR stress testing to price elasticity estimates, I saw many nodding heads when we talked the importance of interpretation in these models. To twist the words of Nobel Laureate Robert Lucas, “once you start thinking about causality, it is hard to think about anything else.” It appears to me that there are still many people interested in the meaning of models in this world of “big data” and “machine learning.”
  • I was able to attend Michele Trovero’s talk on using SAS/ETS® tools to estimate linear and non-linear state-space models. Michele showed several example and benefits of the existing SSM procedure as well as giving a look ahead to the upcoming ESMX procedure. Once released, this procedure will allow new exponential smoothing models to be estimated in SAS as well as provide a statistical treatment to structural time series models based on exponential smoothing.  Use this link to find his slides or listen to this interview to hear Michele describe some of these tools.
  • At the econometrics booth I was asked about many different topics. We talked about state-space models (PROC SSM), time series data management with PROC TIMEDATA and the use of multinomial discrete choice models for price elasticity measurement (PROC MDC). However, far and away the dominating topic was the new CCAR compliance regulations. For those unfamiliar with these regulations, it is a requirement that bank stress tests incorporate macroeconomic scenarios into the forecasts of their financial performance with the goal being to ensure bank solvency under adverse economic events. One of the difficulties of this problem is that the objective of the analysis is no longer strictly a predictive modeling problem. New policies dictate that certain variables must be included and that these effects should behave consistently with economic theory. No fewer than ten banks independently asked about modeling techniques to satisfy these regulations. It was quite interesting, because each bank had a different method of solving the problem. Some banks with access to micro data chose to model probabilities of default based on each asset. Other banks without access to this data have opted for a time-series based approach. For some time now, I have been working on formalizing these methods and I will present a paper on the subject both at the Conference on Statistical Practice in February as well as SAS Global Forum in April.
  • It was my pleasure to lead a roundtable discussion about statistical and econometric modeling in health insurance. It was a packed table and I apologize if you were turned away. Many of the topics discussed during our causality talk were echoed during the roundtable, most notably non-random assignment of certain interventions. In fact, one large health insurer spoke about the early returns of the Affordable Care Act with respect to substitution effects, or lack thereof, from emergency room usage to traditional clinics. This may suggest that the population now covered by new rules remains unlikely to shift their healthcare usage from high-cost emergency room visits to less costly outpatient facilities.
  • Finally I would like to thank the 15 or so attendees at my 7:30am demo theatre presentation on the new items in SAS/ETS®. These were brave and dedicated souls to be such early risers in Vegas! I always enjoy the chance to evangelize our new tools. We spent a great deal of time talking about PROC HPCDM, a tool for simulating aggregate loss distributions in insurance and banking. People were interested because there now is a very computationally efficient way to simulate aggregate losses subject to business rules about deductibles and limits. We also talked about new methods of estimating limited dependent models in the presence of endogenous regressors. There was an interesting question about tools for spatial econometric models, which isn't part of the current portfolio but will definitely be part of future presentations.
Ken Sanford being interviewed on the demo floor at Analytics 2014

Ken Sanford being interviewed on the demo floor at Analytics 2014 (click on the photo for the interview)

Unicorn hunting: finding data scientists outside traditional academic disciplines

Finding people with the range of skills classified as data science can be a challenge, which is why some call them unicorns (do they really exist?), so I recently posted ten tips on finding unicorns. In my first post I elaborated on tips 1 and 2 (1. hire from an MS in Analytics program and 2. hire from a great program you've never heard of). In this post I'll share two more tips, which entail hunting for data scientists beyond the math, stats, computer science, operations research, and engineering departments where you might most expect to find this kind of talent.

3. Recruit from untraditional disciplines

Wayne Thompson, PhD in Plant Sciences from the University of Tennessee, during his year spent as a visiting scientist at the Institut superieur d'agriculture de Lille in France.

Wayne Thompson, PhD in Plant Sciences from the University of Tennessee, during his year spent as a visiting scientist at the Institut superieur d'agriculture de Lille in France.

As this article from Inc. points out, computer science may not be the best place to find data scientists. In fact, the article refers to a survey of data scientists, of whom 51% recommend that the best source of data scientists is outside of computer science. For that matter, if you limit yourself to other, perhaps more “traditional,” analytical disciplines you may be overlooking some great candidates. Like Wayne Thompson, the Chief Data Scientist at SAS, who studied plant sciences but minored in statistics. His path through agricultural sciences is natural for many of our colleagues, since SAS was founded by a consortium of land-grant universities heavily funded by grants from the United States Department of Agriculture. Over the years many of our senior executives have had degrees in agricultural-related disciplines like forestry, agricultural economics, etc.

Juthika Khargaria, Ph.D. in Astrophysical and Planetary Sciences from University of Colorado, standing next to the 18” telescope at Sommers Bausch Observatory.

Juthika Khargaria, Ph.D. in Astrophysical and Planetary Sciences from University of Colorado, standing next to the 18” telescope at Sommers Bausch Observatory.

 

 

Consider Juthika, an analytics solutions architect who assists customers in defining their business problems and demonstrating how SAS advanced analytics solutions could help. But before joining SAS Juthika studied astrophysics, studying complex systems, statistics, and how to deal with abstract concepts. Juthika says that the data astrophysicists deal with has high noise  but low signal, so they are experienced in methods to tease out that signal. See how well Juthika bridges that gap in this blog post she wrote on using WAVELETS to separate the signal from the noise. Physicists also usually have strong computational skills, which is why we have hired several in Advanced Analytics R&D to develop our software.

4. Look beyond STEM departments to recruit from the social sciences

There are plenty of good reasons to recruit from the STEM (science, technology, engineering, and math) disciplines, since these fields provide their students an excellent foundation for analytical problem-solving. But there are good reasons to look to the social sciences as well. Phil Weiss is an analytical consultant who helps customers understand how our advanced analytics software might solve their business problems. He shared, “The value of a liberal arts degree cannot be understated when it comes to being able to more easily handle difficult conceptual problems and the multifarious nature of symbolic systems, especially programming languages….My statistics training and close association with ‘big data’ derived from depositional patterns allowed me to transition into computer science even though I had limited training in any STEM field.” In fact, as this article from Fast Company shows, many tech CEOs even prefer to hire from the liberal arts, arguing that these disciplines train students to “thrive in ambiguity and subjectivity,” which are hallmarks of any real business environment.

Phil Weiss (on left), ABD in Archaeology, Arizona State University on a dig while in grad school.

Phil Weiss (on left), ABD in Archaeology, Arizona State University on a dig while in grad school.

Most PhD programs in the social sciences require their students to take courses in the quantitative methods necessary to do data-driven research, so they may even have a more substantial foundation than you’d expect. The School of Social Welfare at the University of Wisconsin-Milwaukee even offers a Graduate Certificate in Applied Data Analysis Using SAS.  Significant research on statistical theory and quantitative methods is being done in colleges of education. There is an emerging field of computational journalism. And my colleague Ken Sanford has written and spoken extensively about why economists make great data scientists. So wander across campus to different buildings on your recruiting trips, and you may be delighted and surprised at what you find.

Next time:

5. Try before you buy - create an intern program

6. Sponsor foreign nationals