What if the Romans had analytics software? Analytics 2015 goes to Rome

SAS is hosting this year’s European Analytics 2015 conference in Rome November 9 – 11. This three-day inspiring event will give you the chance to boost your company’s analytics culture in an international environment to make sure your knowledge and expertise meet the demands of the digital era. But what if we could travel back in time to the ancient Rome and take today’s analytics software with us? Would SAS be of any use for the Romans?

ColosseumIn the 6th century BCE the Roman king Servius Tullius instituted the census. The word "census" originated from the Latin word censere ("to estimate"). During ancient Roman times every five years every male Roman citizen in the empire had to register for the census. There’s even a reference to a Roman census in the Bible, where the birth of Jesus occurred in Bethlehem, because Mary and Joseph traveled there to be enumerated in the census. The census was used to count the number of citizens in the Roman empire, the potential military strength of the empire to attack other countries, and for tax revenue purposes.

But imagine the extended use of the census if only the Romans could have benefited from today’s SAS analytics software….

SAS Forecast Server could have been used when planning a school system, as there is a need for a forecast of the number of pupils who are likely to go to school, in order to ensure that the schools will be built in locations with appropriate capacity.

The most efficient roads could have been designed with SAS/OR Software to optimize traffic and transportation.

With SAS Enterprise Miner the Romans could have discovered interesting patterns with regards to the spread of diseases, the harvest of grains, lost and won battles, and they even might have done some accurate predictions!

SAS Visual Analytics could have supported the senate in making decisions, for instance by visualizing characteristics of the population in the entire empire.

If you want to understand how your company can benefit from SAS Analytics software, then don’t hesitate to join us in Rome for Analytics 2015. I’m sure Servius Tullius would have been delighted to be there! According to me he would be particularly interested in attending some presentations of the SAS Talks track, which will be presented by employees in R&D and Product Management. Just a sample of employees flying over from headquarters to present include:

Explaining the Past and Modeling the Future: What's new in econometrics and time series Ken Sanford, Principal Research Statistician in Advanced Analytics R&D - SAS

Exploiting Parallelization in Optimization Manoj Chari, Senior R&D Director in the Advanced Analytics R&D Division - SAS

Statistical Model Building for Large, Complex Data: Four New Directions in SAS/STAT® Software Bob Rodriguez, Senior Director Research & Development -SAS

Post a Comment

SAS on GitHub

Right now I’m crossing the Pacific toward Australia and New Zealand for the 21st ACM SIGKDD Conference on Knowledge Discovery and Data Mining (a.k.a. KDD), a Data Science Melbourne MeetUp, and the SAS Users of New Zealand conference. New Zealand is the birthplace of open source R. So this trip has me thinking a lot about the relationship between open source and proprietary analytics tools. I’m expecting some tough questions from the local data science community, but one thing I’m really excited to share with them, and with you, are several new, SAS-sponsored, open source GitHub repositories (repos). Below you’ll find out why I’m excited about SAS’ participation in the open source analytics community, why I think GitHub is a great way to facilitate that participation, and more info about the repos themselves.

Whether you work alone, for a startup, or for an established company, participation in an open source project is about way more than just sharing code – among other things, it fosters intellectual ownership of the project and boosts your visibility. Proprietary analytics software  is seldom mentioned on social media by people not on the payroll of said proprietary product. Yet my Twitter feed is literally choking on what amount to unsolicited love poems to open source analytics software. (That’s right, I have no life whatsoever.)  Users of open source analytics tools feel a sense of pride and intellectual ownership toward their software, and this is incredibly important . Once a user feels ownership of a tool, of course they will blog, tweet, and defend it to the death in those “spirited” SAS vs. R vs. Python debates. Why does representation on social media matter? Because perception can become reality. Analyst firms, practitioners, journalists, and even decision makers at large corporations use social media to know what is current. If social media discussions are dominated by proponents of open source tools, then people might begin to mentally relegate SAS to the legacy software dustbin. Social media matters even more for younger practitioners. They have to polish their online reputations just like older professionals polished their paper resumes. SAS provides excellent documentation and examples online and social channels to communicate about our tools, but the new generation rightly clamors for a two-way channel of communication in which useful technical contributions can be shared easily and can boost an individual’s professional reputation. SAS’ is putting forward these new GitHub repos in hopes that user contributions will encourage both intellectual ownership for users and public recognition of valuable contributions.

GitHub

GitHub

Just in case some of you don’t know about them yet, Git and GitHub combine into a sophisticated solution for version control, bug tracking, and collaboration between developers, plus some social goodies to help you follow your favorite developers and projects. Oh yeah, and they’re free. For long time SAS Users, it has code snippets and snarky repartee like SAS-L, plus really nice development tools. GitHub even has syntax highlighting for SAS. The only thing missing from the party is you!

So without any further ado, the repos are:

github.com/sassoftware/enlighten-integration - Integration strategies between Java, PMML, Python, R, and SAS.

github.com/sassoftware/enlighten-apply - Applied machine learning projects.

github.com/sassoftware/enlighten-deep - Code for deep learning (as it becomes available).

Why the name enlighten for the SAS repos? For two reasons. First, it is a move of enlightened self-interest for SAS to put macro code on GitHub: SAS will benefit from user knowledge and contributions. Second, SAS wants to shed light on some of our own more advanced machine learning capabilities. SAS has been working in machine learning since the early 1980s and has a lot of production quality machine learning functionality to offer to our customers.

These repositories are a work in progress, and we really hope they continue to evolve with input from inside and outside of SAS. They have also been a team effort by me and several of my peer SAS Enterprise Miner developers, including: Funda Gunes, Tim Haley, Radhikha Myneni, and Ruiwen Zhang. It’s been a really fun project for us and we look forward to your pull  requests. (Pull requests are one way you contribute to a GitHub repo. Just google it.)

Image credit: photo by othree // attribution by creative commons

Three final talent tips: how to hire data scientists

How do we hire data scientists at SAS, since we are not unique in our search for a rare talent type that continues to be in high demand? This post is the last in a series on finding data scientists, based on best practices at SAS and illustrated with some of our own “unicorns.” You can read my first blog post on calling them unicorns and for tips 1 and 2 on finding them in an MS in Analytics Program or from a great program you may not have heard of. You can read tips 3 and 4 on how to find this kind of talent outside the traditional STEM academic disciplines. And tips 5, 6, and 7 detail the value we’ve found in intern programs, social networks, and sponsorship of foreign nationals.

This last post focuses on less tangible aspects, related to curiosity, clarity about what kind of data scientist you need, and having appropriate expectations when you hire.

8. Look for people with curiosity and a desire to solve problems

Radhika Kulkarni, PhD in Operations Research, Cornell University, teaching calculus as a grad student.

Radhika Kulkarni, PhD in Operations Research, Cornell University, teaching calculus as a grad student.

As I blogged previously, Greta Roberts of Talent Analytics will tell you that the top traits to look for when hiring analytical talent are curiosity, creativity, and discipline, based on a study her organization did of data scientists. It is important to discover if your candidates have these traits, because they are necessary elements to find a practical solution and separate candidates from those who may get lost in theory. My boss Radhika Kulkarni, the VP of Advanced Analytics R&D at SAS, self-identified this pattern when she arrived at Cornell to pursue a PhD in math. This realization prompted her to switch to operations research, which she felt would allow her to pursue investigating practical solutions to problems, which she preferred to more theoretical research.

That passion continues today, as you can hear Radhika describe in this video on moving the world with advanced analytics. She says “We are not creating algorithms in an ivory tower and throwing it over the fence and expecting that somebody will use it someday. We actually want to build these methods, these new procedures and functionality to solve our customers’ problems.” This kind of practicality is another key trait to evaluate in your job candidates, in order to avoid the pitfall of hires who are obsessed with finding the “perfect” solution. Often, as Voltaire observed, “Perfect is the enemy of good.” Many leaders of analytical teams struggle with data scientists who haven’t yet learned this lesson. Beating a good model to death for that last bit of lift leads to diminishing returns, something few organizations can afford in an ever-more competitive environment. As an executive customer recently commented during the SAS Analytics Customer Advisory Board meeting, there is an “ongoing imperative to speed up that leads to a bias toward action over analysis. 80% is good enough.”

9. Think about what kind of data scientist you need

Ken Sanford, PhD in Economics, University of Kentucky, speaking about how economists make great data scientists at the 2014 National Association of Business Economists Annual Meeting. (Photo courtesy of NABE)

Ken Sanford, PhD in Economics, University of Kentucky, speaking about how economists make great data scientists at the 2014 National Association of Business Economists Annual Meeting. (Photo courtesy of NABE)

Ken Sanford describes himself as a talking geek, because he likes public speaking. And he's good at it. But not all data scientists share his passion and talent for communication. This preference may or may not matter, depending on the requirements of the role. As this Harvard Business Review blog post points out, the output of some data scientists will be to other data scientists or to machines. If that is the case, you may not care if the data scientist you hire can speak well or explain technical concepts to business people. In a large organization or one with a deep specialization, you may just need a machine learning geek and not a talking one! But many organizations don’t have that luxury. They need their data scientists to be able to communicate their results to broader audiences. If this latter scenario sounds like your world, then look for someone with at least the interest and aptitude, if not yet fully developed, to explain technical concepts to non-technical audiences. Training and experience can work wonders to polish the skills of someone with the raw talent to communicate, but don’t assume that all your hires must have this skill.

10. Don’t expect your unicorns to grow their horns overnight

Annelies Tjetjep, M.Sc., Mathematical Statistics and Probability from the University of Sydney, eating frozen yogurt.

Annelies Tjetjep, M.Sc., Mathematical Statistics and Probability from the University of Sydney, eating frozen yogurt.

Annie Tjetjep relates development for data scientists to frozen yogurt, an analogy that illustrates how she shines as a quirky and creative thinker, in addition to working as an analytical consultant for SAS Australia. She regularly encounters customers looking for data scientists who have only chosen the title, without additional definition. She explains: “…potential employers who abide by the standard definitions of what a ‘data scientist’ is (basically equality on all dimensions) usually go into extended recruitment periods and almost always end up somewhat disappointed - whether immediately because they have to compromise on their vision or later on because they find the recruit to not be a good team player….We always talk in dimensions and checklists but has anyone thought of it as a cycle? Everyone enters the cycle at one dimension that they're innately strongest or trained for and further develop skills of the other dimensions as they progress through the cycle - like frozen yoghurt swirling and building in a cup.... Maybe this story sounds familiar... An educated statistician who picks up the programming then creativity (which I call confidence), which improves modelling, then business that then improves modelling and creativity, then communication that then improves modelling, creativity, business and programming, but then chooses to focus on communication, business, programming and/or modelling - none of which can be done credibly in Analytics without having the other dimensions. The strengths in the dimensions were never equally strong at any given time except when they knew nothing or a bit of everything - neither option being very effective - who would want one layer of froyo? People evolve unequally and it takes time to develop all skills and even once you develop them you may choose not to actively retain all of them.”

So perhaps you hire someone with their first layer of froyo in place and expect them to add layers over time. In other words, don't expect your data scientists to grow their unicorn horns overnight. You can build a great team if they have time to develop as Annie describes, but it is all about having appropriate expectations from the beginning.

To learn more, check out this series from SAS on data scientists, where you can read Patrick Hall's post on the importance of keeping the science in data science, interviews with data scientists, and more.

And if you want to check out what a talking geek sounds like, Ken will be speaking at a National Association of Business Economists event next week in Boston - Big Data Analytics at Work: New Tools for Corporate and Industry Economics. He'll share the stage with another talking geek, Patrick Hall, a SAS unicorn I wrote about it in my first post.

Is machine learning trending with economists?

I am noticing a trend. At the ASSA meetings in January (where economics, sociology and finance academics and practitioners gather to discuss their research) I was surprised to see how much “machine learning” was trending with economists. The session “Machine Learning Methods in Economics and Econometrics,” with papers by Susan Athey (Microsoft and Stanford) and Pat Bajari (Amazon and University of Washington), was one of the most popular at the conference. Both authors are joining a group of economists that are cautiously dipping their toes into the area of predictive modeling known as machine learning (ML). While ML tools have been becoming increasingly popular with computer scientists, only recently have economists embraced the value of some of these methods.

The second piece of evidence suggesting that machine learning is trending with economists came from a conference I recently attended at the National Academy of Sciences called, “Drawing Causal Inference from Big Data”. The more than 400 attendees from academia, government and industry heard papers from top academics working to merge the field of statistical inference or causality with the tools typically used with “big data.” Here were my highlights from the conference, with recordings of the talks included as links.

  • Michael Jordan  talked on the intersection of statistical computing and inference. He reviewed the literature of inference under constraints.  My favorite part was the concept of “Bag of Little Bootstraps,” which is a method to assess quality of estimators from a large dataset.
  • Judea Pearl talked on the importance of the causality story with “big data.” For example, “subjects of the big data system (patients for instance) will attempt to pull causality from the users of big data (doctors)”…. So correlation won’t be enough for long.  He spent a lot of time talking about the problems of observational data and causality (with a lot of reliance on DAGs).
  • Thomas Richardson spent his time talking on using observational pharmaceutical data to inform efficacy. David Heckerman from Microsoft talked about personalized medicine based on genetic information.  He spent some time explaining his work he calls FaST-LMM.
  • Bernard Scholkopf explained how causal models can help machine learning and gave an overview of his research in this area.
  • Susan Athey’s talk was about how using trees to improve causal inference. This talk was my favorite, because it was an overview of many different ML methods that can assist an economist in model specification as well as lots on cross-validation and heterogeneous treatment effects.

The third piece of evidence for this trend is a shameless plug for an upcoming talk I am giving. At an event hosted by NABE, Big Data Analytics at Work: New Tools for Corporate and Industry Economics, Patrick Hall, Chief Machine Learning Scientist at SAS, and I will give economists an introduction to the methods and technology needed to get started with ML methods. We plan to discuss many of the methods that can be used to glean insight from large data sets, whether they are long or wide. There are very few seats left for the conference, which is June 16-17, so if you haven’t already done so, sign up now! It will be an excellent introduction to both methods and potential applications. I will make sure to post our slides after the conference. Let’s see if the trend toward machine learning in economics continues!

Numeric validation for analytical software testing

There is a job category unfamiliar to most people that plays a crucial role in the creation of analytics software. Most can surmise that SAS hires software developers with backgrounds in statistics, econometrics, forecasting or operations research to create our analytical software; however, most do not realize there is another group of people who work closely with individual developers to test their code. For the analytics products at SAS these people are called analytics testers. What do they do?

At SAS, verifying the correctness of procedural output is termed “numeric validation.” This process consists of independently checking and verifying all the numeric output created by a developer in a SAS procedure or function. Just as SAS has invested in a large stable of talented developers with advanced degrees in specialized areas of statistics, econometrics, forecasting, operations research and mathematics, SAS has also invested in an equivalent stable of analytics testers with advanced degrees in the same specialty areas. One primary responsibility of an analytics tester is to ensure numeric correctness independent of the developer, which they typically do by replicating the method in alternate code. Think of dueling PhD’s racing to implement the same algorithm but in different ways. Part of ensuring numeric validation means their results must agree, because agreement provides greater assurance that the implementation and output of the algorithm is correct.

numeric validation

Developer Ying So on the left and analytical tester Yu Liang on the right, who "dueled" over the survival analysis work described in this post

But what happens when they do not agree? This happens quite often during the software development process. As you might imagine, this leads to a lot of head-scratching and white-boarding to figure the disparity out. When the numbers are different, the numbers for both the developer and tester have been called into question. Who is correct? Did the developer implement the statistical algorithm in the C language the same way the analytics tester did in SAS/IML, or vice versa? Was there a different interpretation of the algorithm from the source material like a journal article? Figuring out the discrepancy can take time to converge, due to the subtle nature and abstract interpretation required of many algorithms and mathematical approaches that require years of training in order to understand. Codifying those subtleties and abstractions into closed-form, robust and well-tested C code that produces the correct values for SAS customers can be quite arduous, and so is numeric validation of the results using an independent pathway.

Let me give you a concrete example. One of the analytics testers for SAS/STAT who is responsible for testing and validating a heavily used SAS procedure used in drug trials identified an issue with the cumulative incidence function (CIF) that was part of a new feature under development. Her numbers did not match output from the procedure the developer created. She notified the developer and thus began a three week exercise analyzing why their numbers differed. The tester had to write a 900 line SAS/IML program to independently calculate the CIF, because there were no other independent means to validate it. After much discussion, the developer determined that that the analytics tester’s approach was technically correct and adjusted his C code for the procedure accordingly.

On the surface one might think this is just two statisticians arguing over a seemingly arcane issue, but the computation is critical in a field of statistics referred to as survival analysis. Biostatisticians and medical researchers use survival analysis to determine which factors increase the probability of survival for subjects in medical studies. Life-altering decisions are made based on the results of this analysis, so it is not hyperbole to say that numeric validation can be a life and death matter.

SAS analytic software is utilized the world over to drive decisions in nearly every area of scientific research, business, and government policy setting. Ensuring SAS software is well-tested and numerically correct is a key factor to the integrity of driving those decisions, and analytic testers are a core part of that process. A few years ago SAS CEO Jim Goodnight said, “SAS was, is and always will be a collection of people who put the needs of the customer first, produce quality software unmatched by any other, and thrive in an innovative workplace that serves as a model across the globe.” This week at SAS is the second annual Quality Week, which is dedicated to raising awareness of the importance of software quality and empowering employees to help solve quality challenges. As SAS employees shine attention on quality this week I am proud to lead a team with a particular contribution to quality related to numeric validation.

 

Easter forecasting challenges the Easter Bunny brings

The date of Easter influences our leisure activities

Easter forecasting challengesDifferent from many other public holidays, Easter is a so-called movable holiday. This means that the Easter bunny brings more than just eggs for the statistician - he brings special Easter forecasting challenges. In the year 325 CE the Council for Nicea determined that Easter would fall on the Sunday after the first full moon in spring. The earliest date for Easter is thus March 22nd, the latest April 25th. Carl Friedrich Gauss took this rule into consideration and developed the Gauss algorithm for Easter, which allows the determination of the date of Easter for a given year.
The moving date of Easter matters, because this holiday has a very strong influence on our leisure activities. Many people use the feast of Easter to take a spring holiday, which in turn has a strong impact on the hotel, restaurant, and transportation industries. Our choice of activities also depends on the timing of Easter. For many of my fellow Europeans, when Easter falls at the end of March, we may plan a ski vacation, while an April date may call for plans that involve enjoying the warming spring weather instead of seeking the snow.

Analysis of the data mirror that behavior

Time series data on a monthly basis reflects this behavior. In a multi-year observation you often see that the peaks switch between March and April.

  • When Easter falls in March, winter sports resorts in the Austrian Alps often have a full hotels in that month. In April, however, occupancy drops very sharply, because Easter often also means end of the season for the ski lifts.
  • The effect of the importance of the Easter holidays can also be observed by the number of air and rail travelers or visitors to recreational facilities.

The following graph shows the number of visitors at a leisure park during the period from 1993 to 2000. In order to facilitate recognition, only the months March (blue) and April (red) are shown. In addition a green step curve shows whether the beginning of the Easter falls in March or April in each year.

Svolba_EasterEffect
It is easy to see that for a holiday that starts in March, the blue line is higher than the red line. For a holiday that starts in April, the reverse is true. We can clearly see the behavior of consumers and tourists can in our analysis of the data.

In the evaluation of results, the consideration of this fact is important. From a sales perspective it does not make sense, to panic when the March sales fall behind those of the previous year when the "Easter Peak" is expected only in April.

Easter forecasting challenges

A peak that varies between March and April can be a problem for time series forecasting models. Many simple time series models only use the historical values of the time series. If Easter falls in April in two consecutive years, the models expect the next peak again in April, because the model has only learned the seasonal effect, leading to these Easter forecasting challenges.

This fact is important because there is an increasing trend for planning software packages to offer only a few simple standard methods of time series forecasting. For some applications a simple seasonal exponential smoothing model does quite well. But it takes a little more effort to consider the above situations. You do not want to manually correct your budget figures every time for the “Easter effect.” It is tedious to always explain your own numbers in a planning meeting with the caution,"Yes, but this year...."

Event, inputs and flexible time intervals make a difference

The forecasting solution from SAS is built to consider these points for you:

  • Automatic model selection checks for you whether the time series can be forecast with a simple model or whether a more powerful model should be used.
  • Your individual data are analyzed to select the most appropriate model from a model library and fit it to your data.
  • The forecast considers input variables like the number of workdays, Saturdays, or Sundays, as well as already-booked orders.
  • Pre-defined calendar events like Easter or Christmas can be considered in the analysis.
  • You can define individual events that are important for your company or organization that influence the course of your key performance indicators.
  • You have the possibility of adapting the length of time intervals to the seasonal pattern of your processes.

Next year, Easter takes place on March 27th; in 2017 it will fall on April 16th. Check now to see whether your forecasting software and the resulting forecast numbers consider this shift.

SAS Global Forum, another moveable date

Another event that has a moving date is SAS Global Forum. This year it takes place April 26th-29th in Dallas, Texas.
My paper, Want an Early Picture of the Data Quality Status of Your Analysis Data? SAS® Visual Analytics Shows You How“ is scheduled for Wednesday, April 29th, 10:00-10:50 a.m. (Preview the A2014 Demo Version). I look forward to meeting you there!

 

The unicorn quest continues – more tips on finding data scientists (part 3)

Because finding analytical talent continues to be a challenge for most, here I offer tips 5, 6, and 7 of my ten tips for finding data scientists, based on best practices at SAS and illustrated with some of our own “unicorns.” You can read my first blog post for why they are called unicorns and for tips 1 and 2 on finding them in an MS in Analytics Program or from a great program you’ve never heard of. You can read tips 3 and 4 on how to find this kind of talent outside the traditional academic disciplines. Today’s post focuses on interns, social networks, and sponsorship for permanent residency.

5. Try before you buy – create an intern program

Golbarg Tutunchi (ABD* in Industrial Engineering/Operations Research, NC State University), Fatemeh Sayyady (PhD in Operations Research, NC State University), Zohreh Asgharzadeh (PhD in Operations Research, NC State University), and Shahrzad Azizzadeh (ABD* in Operations Research, NC State University).

Golbarg Tutunchi (ABD* in Industrial Engineering/Operations Research, NC State University), Fatemeh Sayyady (PhD in Operations Research, NC State University), Zohreh Asgharzadeh (PhD in Operations Research, NC State University), and Shahrzad Azizzadeh (ABD* in Operations Research, NC State University).

Intern programs are a great way for both the employer and the student to find out if the candidate is a good fit at the organization, so SAS hires close to 200 students each year. In addition to the standard internship programs that we run, we have two special programs that are particularly useful in our quest to find analytical talent. The Graduate Research Assistant (GRA) program places PhD students from local universities into Advanced Analytics R&D for up to 20 hours/week during the year and full-time during the summer (we had 14 in 2014). These students work on research in SAS related to their graduate training, and we maintain a partnership with their academic advisor. SAS provides funding to the academic departments, who in turn provide a stipend, tuition, and benefits for the students. Academics benefit from exposure to problems in practice, SAS benefits from academic insight into our research areas, and the students receive funding but also exposure to practice. R&D also has technical student positions that function like regular internships. R&D hires most of these students when they finish, including the women above who are part of the growing Iranian diaspora at SAS—Fatemeh (who works on SAS® Marketing Optimization), Zohreh (who leads a team of operations research specialists who develop solutions for retail and manufacturing), Shahrzad (who tests our mixed integer programming solver and decomposition algorithm in SAS/OR®), and Golbarg (a current GRA student who works on our Advanced Analytics and Optimization Services team, which provides specialized consulting on optimization and simulation problems).

Advanced Analytics R&D also has an Analytical Summer Fellows program, which hires PhD students for a special summer internship to expose them to the world of commercial statistical software research and development and provide an opportunity to explore software development as a career choice. In addition to a standard summer internship salary, the students receive a stipend, which allows us to recruit students from around the US. The program began with one student in statistics in 2006, and thanks to our CEO Jim Goodnight’s support and enthusiasm for programs of this kind, we have now expanded it to include positions for students in statistics, data mining, econometrics, operations research, and text analytics.

 6. Use social networks to hire the friends of your unicorns

Left circle, Zohreh Asgharzadeh, and right circle, Shahrzad Azizzadeh, at Sangan Waterfall, northwest of Tehran, with classmates from their Sharif University of Technology Industrial Engineering class.

Left circle, Zohreh Asgharzadeh, and right circle, Shahrzad Azizzadeh, at Sangan Waterfall, northwest of Tehran, with classmates from their Sharif University of Technology Industrial Engineering class.

We know the power of social networks from an analytical perspective, but have you thought about using them in your recruiting? At SAS, 55% of our hires come from an employee referral. After all, while merit is critical choosing the best candidate to hire for a given role, having a “spotter” help you fill the pipeline can help immensely, since the pool of unicorns is small in the first place. And who better to know where to hunt for unicorns than a member of their own species? Our employees are eager to recruit people they know, because they consistently tell us they want to work with smart people.

Remember Zohreh and Shahrzad, two of our former GRA students? They have been close friends since high school in Iran. Zohreh first came to the US to graduate school at NC State University, to pursue her PhD in Operations Research. Shahrzad was still in Iran, first pursuing a MS in Industrial Engineering and then working in industry. Hearing about her friend’s positive experiences in graduate school in the US, Shahrzad decided to apply to a US graduate school herself. Zohreh worked as a GRA student at SAS on retail optimization solutions, and because she enjoyed her work opted to work at SAS full-time upon graduation. Later, when Shahrzad saw an advertisement for a GRA placement at SAS, she asked her friend Zohreh about it. Zohreh encouraged her to apply, and the rest is history. And Fatemeh (from the photo above) applied for a position in part because she had a positive impression of SAS conveyed by her friends Shahrzad and Zohreh.

These two long-time friends have quite the academic pedigree, since their math tutor in high school was Maryam Mirzakhani, who last year won the Fields Medal, widely considered the Nobel Prize of math. As the first woman in its 78- year history to have won she is a trail blazer. These women challenge the tradition of Western world, where far fewer women choose the STEM disciplines, because they were encouraged to pursue their interest in math. Curious about that difference, I found an interesting article that provides a cross-cultural analysis of students with exceptional math talent and provides some recommendations. Its conclusion is sobering: "In summary, some Eastern European and Asian countries frequently produce girls with profound ability in mathematical problem solving; most other countries, including the USA, do not." So while efforts are being made to close that gap, perhaps an additional tip should be that if you want to hire more female unicorns, seek out the Iranians!

7. Be willing to invest in sponsorship for foreign nationals

Bahadir Aral (PhD in Industrial and Systems Engineering from Texas A&M University), in front of the Süleymaniye Mosque in Istanbul.

Bahadir Aral (PhD in Industrial and Systems Engineering from Texas A&M University), in front of the Süleymaniye Mosque in Istanbul.

"In today's global economy, innovation is the key to sustained growth and success. At SAS, we have long recognized this fact; it is why we are so committed to attracting and retaining the best and brightest minds from across the globe," wrote Jim Goodnight in this opinion piece on immigration reform. Part of his interest in immigration derives from the fact that economic forecasts project that American universities will produce 1 million fewer graduates from the science, technology, engineering and math disciplines (STEM) than the US workforce needs. And he points out that American universities are not currently able to meet that need with US-born students. In the US we are fortunate that top students from around the world are drawn to our excellent higher education system, so American companies can recruit top global talent domestically. An obvious repercussion is that some of these graduates will need legal sponsorship for the papers to stay in your country. As our CEO wrote, at SAS we feel this can be an investment with high return - the fees are not significant and the talent available is high.

For example, Bahadir is a pricing expert and consultant in our Advanced Analytics and Optimization Services group in R&D. Even though he came to SAS with almost five years of professional experience, he still needed to complete paperwork for a green card to be able to work in the United States. When his work visa ran out he had to return to his native Turkey and work from our Istanbul office for 6.5 months while the next step in his immigration application was being processed. Bahadir is so dedicated that he maintained his customer meetings in spite of the seven hour time zone difference, which meant he was sometimes scheduled for telephone calls that approached the midnight hour in Turkey! Another pricing expert on his team, Natalia, is Russian by birth but as a child moved to Mexico, where she obtained her first PhD in industrial engineering at Tecnológico de Monterrey. She then moved to the US to pursue a second PhD, this time in Operations Research, from North Carolina State University. She arrived at SAS with dual Russian and Mexican citizenship but didn’t yet have permanent residency, so SAS is also sponsoring her application.

Natalia Viktorovna Summerville in front of St. Basil’s Cathedral on Red Square in Moscow.

Natalia Viktorovna Summerville in front of St. Basil’s Cathedral on Red Square in Moscow.

Last three tips next time:

8.  Hire the curious who want to solve problems
9.  Think about what kind of data scientist you need
10.  Don’t expect unicorns to grow horns overnight

What makes the smart grid “smart”? Analytics, of course!

You’ve heard about the smart grid, but what is it that makes the grid smart? I’ve been working on a project with Duke Energy and NC State University doing time-series analysis on data from Phasor Measurement Units (PMUs) that illustrates the intelligence in the grid as well as an interesting application of analytical techniques. I presented some of our findings at the North American SynchroPhasor Initiative (NASPI) workgroup meeting in Houston recently, so I thought I’d share them with a broader audience.

PMUs in the power grid

PMUs in the power grid

Phasor Measurement Units (PMU) take measurements of the power transmission grid at a much higher speed and fidelity than previous systems provided. PMUs take measurements on the power frequency (i.e. 60hz), voltage, current, and phasor angle (i.e. where you are on the power sine wave). These units take readings at a speed of 30 measurements/second, while the previous systems just took readings every 3-4 seconds. This more frequent interval provides a much more detailed view of the power grid and allows detection of sub-second changes that were completely missed before.

Another great feature of the PMUs is their very accurate time measurement. PMUs are installed at points along the power grid miles apart from each other. For example, Duke Energy has over 100 PMUs installed across the Carolinas. To analyze data and learn about the whole grid, we need to synchronize the measurements taken at these locations. PMUs have Global Positioning System (GPS) receivers built in, not to determine the location, but so all can get the same accurate time signal. Since GPS systems provide time accuracy in the nano-second range, this is sufficient for our measurements at 30/second. This accuracy is most critical in the measurement of phasor angles. By comparing the phasor angles between locations, we get a measure of the power flow between the locations. Since the measurements are of something changing at a frequency of 60hz, the time stamp of the measurement must be of significantly higher precision than what is being measured.

Working with this data has highlighted the similarity and differences between working with big data and high-speed streaming data. Big data is typically a large amount of data that has been captured and stored for analysis. Streaming data is constantly coming in a high rate of speed and must be analyzed as it is being received. One of the many interesting things about this project is that is involves both big data and streaming data.

So what have we learned working with this data? The main purpose of this project is to detect and understand events that are affecting the power grid, with the objective of keeping the grid stable. We have learned there are a number of time-series techniques that are needed for the different aspects of providing the needed answers. The analysis flow breaks down into three areas: event detection (did something happen?), event identification (what happened?), and event quantification (how bad was it?).

For event detection, the task at hand is streaming data analysis. The system generates 30 measurements/second on hundreds of sensors and tags. Fortunately a majority of the time (>99.99%) they indicate that no event of any kind is occurring. Since there are time-series patterns present they can be modeled and used to detect when there is a deviation from the normal pattern. Determining these models allows us to look forward with a very short-term forecast and then instantly detect an event of interest.

Event identification is the next order of business. An event of interest doesn’t necessarily mean there is a problem or that one will develop. Some events are random, like a lightning strike or a tree hitting a power line. Others represent some type of equipment failure. We’ve determined that many of these events produce a similar ‘signature’ in the data stream, because time-series similarity analysis and time-series clustering have been able to match the incoming events to previously seen events. Knowing which previous event signatures are non-consequential allows us to safely ignore them.

Finally we look at event quantification. For some events, the question is not just that the event is occurring but also whether the magnitude of the event gives cause for concern. An example is oscillation on the power grid. Small but diminishing oscillations are not necessarily a problem, but larger ones that are increasing may require further attention. Once the event type is identified, each has some specialized techniques to determine their magnitude and consequence.

This project has provided interesting insights into how to make the power grid smarter. Many of these techniques are also beneficial to streaming data analysis seen in other industries and applications. If there is a need to automatically identify and categorize system events based on data patterns, or filter out events that are non-consequential, then these techniques will be helpful.

Photo credits

PMUs in the power grid: Syncrophasor Technologies and their Deployment in the Recovery Act Smart Grid Programs, August 2013 report by the US Department of Energy

Firefighter image credit: photo by US Navy // attribution by creative commons

Hoover Dam image credit: photo by IAmSanjeevan // attribution by creative commons

Are there jobs for economists in analytics?

SAS will again be participating in the Allied Social Science Association annual meetings in January. This year the event will be held in Boston, and conference organizers expect more than 12,000 participants from a variety of backgrounds, including economics, finance and many other social sciences. One of the primary functions of the event, aside from traditional academic sessions, is that it serves as a single meeting place for employers and job candidates. Each year, approximately 1,000 candidates attend ASSA for the sole purpose of finding a job. As I’ve written before, corporate economists are hot again and a great source for analytical talent, so if you’re on the job market consider exploring a career in industry. It’s a great place for economists to look for a job.

I’ve worked in academia and industry, so I know both worlds. And in my current role I constantly talk with economists performing analytical functions in some of the world’s largest companies. What I have seen is that both worlds provide ample opportunity to utilize your economist skills. In fact, I would conclude that a corporate analytic role will challenge your data skills in a way unlike academia. You will be pushed to learn both methods and as an economist you will be uniquely positioned to explain results to colleagues. There is, of course, no ‘free lunch,’ so your research will be guided by the firm’s revenue maximization or cost minimization priorities.

As you may know, SAS is one of the preferred analytical computing platforms by both businesses and government. In fact, Monster.com ranked it #1 on their list of “Job Skills that Lead to Bigger Paychecks.” For these reasons I was curious to conduct a little research about jobs using SAS. I chose to search the JOE website (Job Openings for Economists), the primary listing source for jobs in economics. I limited my search to positions listed as “Full-time nonacademic,” as universities are not likely to prefer one computing language over another. Of the 247 nonacademic listings, here is what I found for each program.

Search Term Number of jobs and link
SAS 41
Stata 34
Matlab 27
R 16
Python 11
SPSS 5
SQL* 9

*while SQL is not a computational program I would consider it a language.

SAS was explicitly listed in 41 job descriptions. Each number above is a hyperlink to the actual search, should you be interested in seeing the jobs available.

For those attending the ASSA conference, please stop by the SAS booth in the exhibit hall and say hello. And if you want to talk about the market for economists in industry or would like to meet some of this year’s job market candidates or industry representatives, please consider attending the SAS Academic and Industry Reception to be held on Sunday, January 4th at 5pm. If you are not attending ASSA but will be in the area on that date, we’d be glad to have you join us at this reception. Please click here to RSVP.

Don’t be fooled: there’s really only two basic types of distributed processing

Every time I pick up a new article about analytics, I am always disappointed by the fact that I cannot find any specifics mentioned about back-end processing. It is no secret that every vendor wishes they had the latest and greatest parallel processing capabilities, but the truth is that many software vendors are still bound by single-threaded processing – as indicated by their obvious reticence about discussing details on the subject. As a result of using older approaches to data processing, most competitors will toss around terms like ‘in-memory’ and ‘distributed processing’ to sow confusion about how their stuff actually works. I will explain the difference in this post and tell you why you should care.

The truth is that there are really only two basic types of distributed processing, namely multi-processing (essentially grid–enabled networks) and pooled-memory massive parallel processing (MPP). Multi-processing essentially consists of duplicate sets of instructions being sent to an array of interconnected processing nodes. In the latter scenario, each node has its own allocation of CPU, RAM, and data, and generally does not have the ability to communicate or share information with other nodes in the same array. While a large multi-step job can be chopped up in pieces and each piece processed in parallel, the multi-processing configuration is still largely limited by duplicate, single-threaded sets of instructions that need to run.

Contrast multi-processing with a pooled-memory architecture that has inter-node communication and does not require duplicate sets of instructions. Each node in a pooled-resource configuration can work on a different part of a problem, large or small. If any node needs more resources, data, or information from any of the other nodes, it can get what it needs by issuing messages to any of the other nodes. This makes for a truly ‘shared resources environment,’ and as a consequence it runs about ten times faster than the fastest multi-processing array configuration.

Now much of the confusion about these two types of distributed processing exists because of misuse of the term ‘in-memory’. The fact is that ALL data processing occurs in-memory at some point in the execution of a set of code instructions. So ‘in-memory’ is really a misnomer for distributed processing. For example, traditional SAS processing has always occurred in-memory as blocks of data are read from disk into RAM. As RAM allocations have gotten larger, more data has been loaded into memory, yet the instructions were still processed using a single-threaded and sequential approach. What was needed was a rewrite of the software to enable multi-threading, namely routing separate tasks to different processors. Combining a multi-threaded program with all data pre-loaded into memory produces the phenomenally fast run-times as compared to what was able to be accomplished before.

Even though a program is multi-threaded, there is still no guarantee that things will run faster. An obvious example is Mahout, an Apache project that relied on MapReduce to facilitate inter-node communication in a pooled-resource environment. MapReduce is notoriously slow, as nodes take a long time to load data into memory and must write inter-node communication requests to disk before other nodes can access the request. As a consequence of its lethargic response time, Mahout has largely been abandoned by most large business customers in favor of faster architectures.

Message Passing Interface (MPI) is a much faster communication protocol, because it runs in-memory and it can accomplish multiple data iterations that are common to predictive analytics work. Currently there are only two MPI initiatives that offer true multi-threading, one based on Spark, an in-memory plug-in to Hadoop, and SAS’ High Performance Analytics. Spark development is still in its infancy, comparatively speaking, and it will likely be years before any push-button applications can make use of its capabilities. Alternatively, SAS has products that are production-ready today and can dramatically shorten your analytics lifecycle. So, do not be fooled by claims of in-memory or distributed processing, because MPI-enabled pooled-memory processing is here to stay and bodes well to become the de facto standard for all future predictive analytics processing.

For many standard analytics jobs, your standard architecture may be sufficient. But these phenomenally-fast run times matter when you are trying to process dozens, if not hundreds, of tournaments that consist of the most advanced machine learning techniques like random forests and deep-learning neural networks. Statistical professionals are finding that these new techniques are not only more accurate, but they also allow us to investigate much lower levels of granularity than ever before. As a result, models are getting more precise and profitability is increasing concomitantly. So if you want to solve more problems faster and with more accuracy (plus use the same headcount), be sure to investigate claims of “in-memory” and choose the right architecture for your job.