Dear Ms. Value! I am missing you! - or the importance of missing values in analytics

Don’t worry! This is not an excerpt from a romantic love letter. The title of this blog post is an allusion to my talk on "Missing Values", at the A2013 conference in June in London.

There is not much time for emotions: dealing with missing values in analysis is not a romantic game but an inevitable reality for many statisticians and data miners. Thus, there is a reason to give this issue appropriate attention. In this post I want to share exclusively with you some of my thoughts about missing values and provide a taste of my A2013 presentation.

It is important to be clear: Missing values and missing values are not always the same (this must be the first time in statistics that there is general recipe for something). A close look at the properties of missing values is absolutely necessary to have the right recipes on hand to detect and treat missing values.

How do I know that something is missing?

This question may sound trivial. A missing value in a table can be recognized by the fact that in a cell is not filled in but is empty! In fact, there are many examples of such explicit missing values. A field is left blank because the date of birth is unknown. Or a customer refuses the answer and no value is entered in the field "number of children." Such cases can be easily detected and selected by database queries.

But not all missing values reveal themselves in such a simple way. The analysis of the number of machine downtimes in a factory results in the following: three events in April, four in May, two in July. From a pure technical point of view no values are missing and the "number of failures" is always filled. However, from a content point of view we indeed have missing data: there is no entry for June. It now has to be decided whether this means that in June there were no failures or if no information exists for June.

We must check the analysis to see where data points are missing (where the time series has “holes”) and how these holes shall be interpreted from a business point of view. In the example above with monthly data on failures we discover the gap because we know the calendar and know that after May must be June. However, if we want to analyze which and how many products an insurance customer has, and the data from one contract system (e.g. health insurance) is not provided, the situation is different. We then have missing information that cannot be made visible in a simple way. In this case, a data review with a business expert is very important.

Are simple frequencies really sufficient?

Missing values are often described by calculating the relative frequency of missing values for a certain feature in relation to the total number of observations. This can result in the following statistics: 12% of the income value are missing, 8% of the age value are missing, and 4% of the cases for modification date are missing.

Thus we can illustrate which characteristics are most commonly infected by the "missing value disease" in our data. However, what we don’t see from this analysis is how many records have no missing value for any of the characteristics.  But the so called number of "full-records", is for many statistical analyses, an important parameter. In the above example, this may be 88% of the rows, if the values of age and modification date are only designated missing if the income value is also  missing. However, it can also be only 76% of the rows, if the missing values  are distributed.

A common method that is used to detect the structure of the missing values is to analyze the "missing value patterns" in the form of tile charts. Here it can be seen at one glance that about 60% of the records don’t have a single missing value (pattern 000000000000), and that another 30% of the records have a missing value in only one feature (light blue). The little red cells show groups of records where five or more features are missing. This information is important to decide whether missing values shall be imputed by analytical methods.

Just a coincidence?

The purely quantitative analysis of missing values still does not give a complete picture. Moreover, analysis and decisions on the random or systematic occurrence are crucial.

Missing age data, which occurs randomly for 20% of customers, represents a loss of information and creates a degree of uncertainty in our data. However, we can assume that although our view is more fuzzy this occurs for all segments, regions, … in the same frequency and our sight is not biased by the missing values.

In contrast, consider the case when missing values occur for those customers who have had a long customer relationship with our company. Perhaps new customers were not previously asked for the date of birth. In this scenario we must be aware that "age unknown" not only correlates with the duration of the customer relationship but also with the age itself (which for older clients is more often missing) and potentially with other characteristics (e.g. region, product type, ...). This fact must be considered in the analysis and the interpretation of the results as well.

For these cases, powerful analytics combined with expert understanding provide methods for the detection and treatment of systematic missing values that go much beyond the simple "I-insert-the-mean-for-missing-values" method and provide a better basis for analysis.

Interested?

If I have gotten your attention, I would be pleased to meet you at my upcoming presentations. On June 12th I will be at the SAS Data Mining User Day in Heidelberg and on June 19th and 20th at the A2013 conference in London. If that's too long to wait, you can "pre-read" my new book Data Quality for Analytics Using SAS.

 

Hungry dogs and decision-making under uncertainty

My wife rescued a dog a couple of years ago from a rural North Carolina rest stop . We named her "DOTi" in honor of the Department of Transportation. It took a while for us to get into the swing of being responsible owners; sometimes the first to leave for work would forget to leave a note indicating that the dog had been fed. Of course, in the age of cell phones, it's a single text to clarify whether the dog ate. But I'd like to do a thought experiment to compare two methods of handling the uncertainty, since it illustrates the strategic challenge organizations face, in addition to that of dog owners.

My strategy, as a dyed-in-the-wool probabilist, was simple. If there was any uncertainty about whether DOTi had eaten, I would flip a fair coin. Heads? Feed her the usual ration of two scoops. Tails? Don't feed her. People who have never owned a dog might think "Why not just ask her if she was hungry?" People who have owned dogs will understand that the typical canine answer to that question has never been anything other than wide-eyed, tail-wagging enthusiasm, signaling an unequivocal need to feed. If DOTi had already eaten, she stood a 50% chance of eating again; and if not, she had a 50% chance of going hungry. Neither outcome is optimal, and decision theorists will recognize them as a "False Positive" and a "False Negative." However, I had worked out that if the probability of DOTi having already eaten was p, then the expected value of extra scoops of food was 2p-1. And, unless I messed up my algebra, the variance of that count of extra scoops was just [1+4p(1-p)]. It had the satisfying finality of a problem with a solution as simple as flipping a coin.

My wife, as usual, was both wiser and more attuned to statistical optimality than I was. Her strategy, which didn't even require a coin toss, was to feed the dog one scoop if there was any uncertainty at all. So her decisions were always wrong---the dog either ate one scoop too many or one scoop too few. I sat down to do a little bit of arithmetic and show her the error of her ways (you very likely can see where this is going). While my strategy at least had a chance of feeding the dog the exact right amount 50% of the time, the other 50% of the time I was off by either +2 or -2 scoops. My wife never got the amount exactly right, but she was never wrong by more than +1 or -1 scoops. While I was busy optimizing on the number of correct decisions, she was devising a strategy that yielded the same expected value of extra scoops (2p-1), but the variance of the extra scoops associated with her strategy was [4p(1-p)]. So we had the same expected error rate, but she had a smaller variance.

If you need to make decisions under uncertainty, then you need a way to objectively measure the effectiveness of those decisions. In this toy example, we have to pick between two scoops dependent on a coin toss or one scoop every time; even so, there are arguments to be made for either strategy. If the situation calls for getting the decision exactly right at least some of the time (like showing an ad to the right customer or auditing a fraudulent tax return), then a strategy should be evaluated based on that criterion. The profit matrix in SAS® Enterprise Miner™ gives you an easy way to weight those outcomes relative to their importance. However, if your goal is to get an accurate estimate of an amount (like the amount of food the dog needs to eat or the dollar amount associated with the ad or audit decision above), then a measure of the size of the error might be more appropriate. Either way, the objective of your decision making should drive the objective evaluation of competing decision making strategies.

Why "soft skills" are so important in analytics....and how to learn them

How many analytical projects have foundered due to lack of problem definition and other soft skills? As my SAS colleague Sascha Schubert writes, people and process matter, in addition to great technology. Great technology is a great first step, but having the right people following the right process is critically important. But how do you find the right people and what process should they follow? What skills should they have and how do you teach them?

Much as we lament the shortage of graduates from the STEM disciplines (science, technology, engineering and math), it is arugably more difficult to find within that pool graduates who also have the right "soft skills." Another colleague, David Leonhardi of Boeing, describes soft skills as the "killer app" for analytics. He quotes Daniel Pink's book A Whole New Mind on the importance of story: "Despite all of the information and data, an effective argument is not enough. You need to have the ability to fashion a compelling narrative to convince and communicate."  SAS and the Analytics Section of INFORMS pose this very challenge in the 2013 Student Analytical Scholar Competition. Students are challenged to read a case study on Floor Mix Optimization at Lucky Duck Entertainment and then submit a Statement of Work (SOW) in response. The winner will be the one who constructs the most effective argument in a proposal that is also compelling and convincing.

Because while "hard math" is required to address the technical issues in this case study, successful applicants will also have to demonstrate their "soft skills." Convincing a client to hire you requires the technical skills to solve their problem, but it doesn't start there. The best process involves following the analytics life cycle, which begins by identifying and formulating the problem. This step alone is often a challenge, as problems are rarely delivered in a "ready to solve" format. It may be necessary to speak to several people from different backgrounds and parts of the organization with varying (even conflicting) priorities and perspectives. Getting agreement on problem formulation alone requires a combination of technical understanding, business acumen, and facilitation skills.

Writing a successful SOW requires understanding, defining, and proposing a solution to the problem, but it also requires conveying this information in a convincing way that can be understood by disparate stakeholders. Learning how to interact with those stakeholders is part of this competion, which offers students a chance to practice these skills. In a discussion board on the AllAnalytics.com site, starting today and running until February 19 at 5:00 p.m. eastern time, three of the main characters from the case study (from IT, operations, and sales) will be logging in to answer questions students may pose. Students may ask up to five questions each, and anyone is able to read the discussion and consider how you'd approach the problem yourself. After all, didn't most of us learn those skills via OJT - on the job training?

I've just started reading Daniel Pink's latest book, To Sell Is Human: The Surprising Truth About Moving Others. He drew me in with his assertion that although statistically one in nine Americans work in sales, so do the other eight, because these days we are all called to convince others. Analytics professionals who thought they could leave sales to others will find that these soft skills of "selling," as Pink describes it, are required of them, too. The winning student whose SOW best "sells" their solution will have their expenses paid to attend the INFORMS Conference on Business Analytics and Operations Research, April 7-9 in San Antonio. At this conference INFORMS even offers a Soft Skills Workshop for those who wants to brush up on these skills. Tune in to the discussion to see how today's students are learning and remember your own journey down this path.

Why people and process matter, in addition to great technology, in predictive analytics

The increasing use of predictive analytics in mission-critical business decisions and operations brings  new challenges to the forefront for many of our customers. Throughout the last year I spoke to many customers about their use of predictive analytics and where they see areas of improvement to achieve even more success with its application in their organizations. One common theme was that there are other factors besides superior technology that make the difference between success and failure for these kinds of projects. Our most frequently espoused point of view is that the value of advanced analytics is harder to achieve and sell within the organization without also seriously looking at the process and the people that need to be involved.

The predictive analytics lifecycle we have created at SAS has proven to be a process model that works well for these discussions. It is a cross-departmental and cross-functional end-to-end process view that is also industry- and vendor-agnostic. It includes all required stakeholders for success - business, IT and analysts - and provides a clear step-by-step approach to implementing, operationalizing and even automating predictive analytics. 

 
In our discussions with the teams involved in predictive analytics projects, we often notice that this process model supports a goal-oriented discussion about responsibilities, tasks and roles for the specific needs of the organization. Especially with the dawn of Big Data and Big Analytics, a clearly-structured process model is becoming not just a nice-to-have but more and more a critical success factor.

It has been fascinating to be part of discussions between business, IT and analysts and see the passion of the teams to make the use of predictive analytics a contributing factor to the success of the company. The lifecycle model has proven to be very helpful in defining their roles and responsibilities in their specific environment and identifying bottlenecks. Often, these bottlenecks relate to gaps in communication, hand-over definitions and data problems, rather than to problems with the technology, which was the perceived issue that triggered the customer to call us. As a result of the discussions and the mapping of the process model to the organizational structure, roadmaps can be developed to achieve a more productive and efficient use of predictive analytics. A side effect of this often is  higher visibility and credibility for  the teams involved, as they are seen by senior management as one unit working together with a common blueprint for success.

For me these projects have shown that by supporting our customers not only with great technology (SAS was just named the leader and an “analytics powerhouse” in The Forrester Wave™: Big Data Predictive Analytics Solutions) but also with the soft factors for successful predictive analytics, such as process and people, we can build stronger relationships and become a partner to them. Delivering the technology while enabling the people and the processes help us help our customers. Please let us know if you want to discuss in more detail the implications of process and people on the success of predictive analytics in your organization.

 

Advancing and assessing analytics maturity: part 2

In part 1 of my thoughts about analytics maturity, I deferred talking about issues related to the actual assessment of your organization’s level. Today I intend to detail some of the ways my peers and I are thinking about analytical maturity, comment on scales in use today, and address some of the innovative ways we have worked to preserve aspects of organizational complexity. And I’ll reemphasize a tip - “Get started now!” Almost all decisions at a departmental level should rely on facts generated from analytical analyses (including predictive modeling), and there are always things you can do to extend your progress.

First, let’s talk briefly about developmental stage labels.  Looking at any analytically-oriented organizational maturity chart it is immediately apparent that the levels are not defined using a uniform set of criteria. Usually classification systems shift definitional requirements when comparing lower levels to higher ones. Since SAS was the first to patent a five-tiered classification system called the Information Evolution Model), I extend that model by  proposing these five stages of development:

  • Immature (the lowest and Level 1)
  • Aware (Level 2)
  • Informed (Level 3)
  • Reliant (Level 4)
  • Innovative (Level 5)

Each level is based on the sophistication of analytics usage – pure and simple. These labels insure that each stage is separate and distinct from the others, and that organizational analytic maturity is uniformly emphasized (apart from what is introduced in books by my SAS colleague Evan Stubbs on  The Value of Business Analytics and Tom Davenport on Competing on Analytics.

The next important requirement of any classification system is to preserve the underlying variability that can be used to distinguish one organization from another, even if they are at the same maturity level. For instance, one business at an Informed level (L3) may have advanced tools, but no statisticians or programmers, while another may have these kinds of people but no analytics tools they can use. What SAS has recently invented is a set of corresponding metrics related to people, process and technology for each of the levels. My colleagues and I are using these metrics to score individual companies and even departments or divisions within larger companies. That way we can easily plot organizations on the larger scale as well as evaluate them to their peers. This scoring system supports the macro-level definitions while preserving the underlying variability that naturally exists from organization to organization.

Regardless of an organization’s level, the ensuing question is always “How do I move up to the next level?” The answer - there are always things you can do to advance your company, and the time to act is “Now!” Creating a fact-based culture must begin by infusing all business decisions with more informed or analytically-derived choices.  Usually we recommend changes to the underlying analytics infrastructure, which includes the people, process, and technology distinctions alluded to earlier and in my part 1 of this series on advancing and assessing analytics maturity. The focus needs to be on leveraging analytics resources from a variety of different infrastructure components. Therefore, if you do not have specific assets to leverage, then priority goes toward obtaining the necessary analytics infrastructure. People and technology acquisitions  are usually the first things to be considered, but it is critical to evaluate important business processes and data flow (i.e. data consumption needs) throughout the organization. And finally, always actively engage upper management in leading and supporting any overhaul/upgrade efforts.

Once a variety of analytics assets are in place, the next step in leveraging those assets is educational and marketing efforts. Teaching others about the benefits and advantages of analytics should be on-going, but it begins with simple things like lunch-and-learn seminars, weekly/monthly newsletters, and conversations among team members about how to better incorporate analytics into daily decision-making. Undertaking a quick analytics assessment is usually the first step in prioritizing current infrastructure and improving data throughput, and is a qualified service that we offer here at SAS.  We can give you a robust set of recommendations that will at least get you started.

There is not a one-size-fits-all approach to get you to the next level of analytics maturity. It may be necessary to leap-frog specific qualities and jump to a higher level of sophistication. But no matter where you are there are always things that can be done to improve analytics usage and information consumption at your organization. By applying constant effort and recording notable successes, your analytics maturity is only bound to increase over time.  It is a marathon race that is really about endurance over the long haul! And SAS is here to help!

marathon runner

Photo credit: amrufm //attribution: creative commons

Analytics better than average - how McDonald's operationalizes analytics

I wanted to share some interesting highlights that I learned from SAS customer Mike Cramer, Director of Operations Research for Worldwide Restaurant Innovation at McDonald’s. Mike Cramer spent a day at SAS Headquarters in Cary, NC, in November 2012 to talk about McDonald’s and their experience and success driving innovation through the use of advanced analytics.

The Plan to Win

McDonald’s has a “Plan to Win,” which is their blueprint to achieve sustained, profitable growth by servicing more customers, more often, with more brand loyalty and more profitability. Like me, you might think that most McDonald’s restaurants are all the same - the same size, the same configurations, the same food, etc.  This is not true; in fact, to say that each store is unique in these combined factors is much closer to the truth. With that fact in mind, Mike shared with us his team’s challenges in identifying the factors in location, the physical design and the demand patterns that would ensure the biggest impact on revenue. They then went about changing behavior by reengineering the operating measurement systems to provide an accurate calculation of performance and metrics to report.

The Operations Research group took on the challenge of deploying scientific methods to address “The Plan to Win” on a restaurant-specific basis.  Historically at McDonald’s, plans of this nature were formulated on a higher level, generalizing the restaurants within a market. However, what happened by generalizing was that information processed at a store level reported up to the executive leadership was based on average measurements. Average amount of throughput of burgers, average working staff, etc. The danger of this approach is that it often led to wrong conclusions and actions – a vicious cycle of average numbers.

How do you measure?

Mike walked us through a new approach of looking at each store individually, factoring in their unique physical design and operating conditions, then predicting what would and what should happen, based on alternative investment strategies and past growth trends. The innovation team used statistical processes, analyzing distributions and trends, and embedded the results in their measurement system. Furthermore, they diagnosed the behaviors at each store during certain time periods and provided a recommended action to be taken to unleash the capacity they currently have to capture the demand. For McDonald’s, channeling idle capacity to serve more customers and preventing customers from leaving due to perceived long lines represent significant revenue opportunities. It’s all about preventing the dreaded “driveoff.”

Watch this 5 minute video on “Analytics Better Than Average,” where Mike Cramer explains this phenomenon in more detail.

Operationalizing Analytics

McDonald’s is now sharing the results of their new analytics-based performance metrics internally with their franchisees and restaurant management, as well as leadership. Their format is called the Sales Opportunity Analysis Report (SOAR), which is a great example of operationalizing analytics and embedding results in decision making.

Will you be next?

I would like to extend a big thanks to Mike Cramer for spending a day with us, giving SAS insights to the McDonald’s way of applying analytics, so that I can in turn share them with you. Among our customer we find strong thought leaders with fresh ideas that we bring back to SAS R&D, and of course this then leads to a virtuous cycle of improved software for you. Better than average!

Do you have a story to tell? Leave a comment here or send a note if you wish to spend a day at SAS letting us learn from you.

To learn more about this topic, listen to James Taylor talk about Operationalizing Analytics.

Five Recommendations to help in recruiting analytical talent

As rain settles in over the green fields of England, I’ve been reading the Times Higher Education (THE) periodical. It’s always a lively read, as it invariably takes the part of untenured junior lecturers in any dispute. It is also very well researched and informed.

This week’s THE edition has the following data table: “% of European students taking up Master’s-level study soon after graduation”*:

>75-100% Ireland, France, Denmark, Poland, Czech Rep., Austria,   Lithuania
>50-75% Germany, Portugal, Switzerland, Italy, Finland, Rumania
>25-50% Scotland, Belgium, Sweden, Hungary, Estonia
>10-25% Spain, Iceland, Norway, Netherlands, Latvia, Greece,   Bulgaria
>0-10% England, Wales, Northern Ireland

As a Brit I view these results from the United Kingdom (UK) perspective, and they give pause for reflection. The table certainly reflects my own direct observations of the vanishingly small number of UK students registered on Master’s-level analytics courses. A high rate of continuation is not necessarily positive - in some higher education  systems the high progression is driven by a need for students to make up for perceived deficiencies in undergraduate provision or because of lack of employment opportunities.

For UK employers and SAS customers seeking graduates with advanced analytics skills, there are now several key factors to take into account when setting up and expanding their analytical capacity:

  1. The advent of Big Data analytics, and the resulting jobs boom (58,000 in the UK alone before 2017), means that the crunch for analytics skills is now upon us.
  2. The present lack of progression by UK students to Master’s level is a constraint on developing analytical teams, because undergraduate degrees alone do not confer sufficiently advanced analytics skills.
  3. The advent of the legislated higher undergraduate fees from September 2012 means that numbers of UK students progressing to Master’s level from September 2015 will probably decline even further, possibly precipitously.
  4. Analytics and advanced analytics courses at UK universities are flourishing, but this increasing enrollment is driven by non-local students from outside the European Union.
  5. The current UK Coalition Government’s restriction on work visas for talented, non-EU, graduates is compounding a worsening situation for builders of analytics teams.

In a previous blog on The Growing Shortage of Analytical Talent and Where to Find It, which generated some discussion, I opined that the best recruiting ground for advanced analytical skills is specialist MSc courses run within business school milieus. This remains true. However, in the light of the boom in business analytics, and the relative scarcity of such courses even in the UK, it seems that new graduate recruitment strategies are urgently needed. For SAS customers in the UK the way forward is a combination of the following recommendations points:

  • For the longer term: start offering generous bursaries and provisional job offers to encourage bright undergraduates to progress to analytics Master’s studies.
  • Establish partnerships with the SAS United Kingdom Academic Programme-sponsored analytics MSc programmes at leading UK and Ireland universities.
  • Offer summer placements, or placement years, to students on advanced analytics courses, and treat the placements as extended interviews.
  • In the short term: widen the graduate recruitment net to universities in European Union countries with robust higher education systems. The local SAS Academic Programmes can assist with making these contacts.
  • Work with local Chambers of Industry to lobby the Government for a more enlightened visa regime so more foreign-born students interested in these fields have the opportunity to stay and work.

A further thought…

The elephant in the room is the value, or lack of it, placed on higher education within our society. Families of students from South and East Asia are mortgaging themselves for up to three generations in order to send loved ones to university in the UK. The conclusions to be drawn from this in terms of work ethic, immigration and talent acquisition could be the subject for many future blogs. Admittedly, this is a UK-centric view; it would be interesting to hear perspectives from other countries on the challenges of developing, finding and retaining analytical talent. How do you see things where you live?

* ”Postgraduate Education: An Independent Inquiry by the Higher Education Commission”, London, 23Oct2012.

 

Assessing and Advancing Analytics Maturity: Part 1

Analytics maturity is a hot topic right now.  Many come to SAS for answers on how to assess their analytics maturity and advance their use of analytics, especially at a corporate level.  I want to share the highlights of what we usually prescribe from a best practices perspective regarding advancing analytics maturity.  We often give customers these four tips:

Good analytics always begins with quality, pre-staged data

Data used for predictive modeling is typically not like other kinds of enterprise data, particularly data consisting of purely operational or transactional records.  Instead, advanced analytics almost always requires transactional observations to be pre-summarized and/or transposed into what is called a “one-row-per-subject” analytic base table or data mart.  This is a crucial step for analytics analysis and is a topic finally getting more focus and attention.  An example of a one-row-per-subject table would be a unique list of customers with their past two years of monthly purchases aggregated into 24 columns – with the ‘one-subject’ being ‘customer’ in this case.

Furthermore, the maxim “garbage in, garbage out” is inherently true for all analytic modeling efforts.  In fact, there are a lot of modifications and quality checks that usually need to be done to data in order to refine it for consumption by specific statistical techniques.  One such modification might be to add new columns to the data that consist of external or explanatory variables.  In reality these additions could be comprised of demographic or macro-economic variables not typically found inside traditional enterprise data warehouses.  Another type of data modification is adding what are called ‘derived fields’, namely columns that are often created from transforming existing inputs and information.  Changes like these are usually necessary to obtain the most value from a modeling data set and comprise a process commonly called ‘analytics data-staging’. 

Improving analytics maturity involves people, process, and technology considerations and re-engineering, not just having better analytic modeling capabilities

Increasing analytics maturity at a departmental level must include a holistic view of infrastructure components that both directly and indirectly impact the success of predictive modeling efforts.  Software and hardware capabilities are major contributors for how an organization is able to conduct analytics, but other factors need to be examined as well.  For instance, outmoded business processes need to be identified and evaluated as to their effectiveness, replacing outmoded processes if necessary.  Another key component of analytics maturity is how well the staff has been trained to incorporate analytics into their daily decision making.  Sometimes it means hiring new people, but other times it means educating the staff on new choices they need to make based on factual information and statistical predictions.

There are always things that can be done to improve analytics usage, regardless of where you are maturity-wise

Regardless of how you assess yourself, your business unit, or your division, there are concrete things you can do to improve your analytics maturity.  Perhaps within your company predictive analytics are largely limited to production applications and the full benefits of analytics are not understood by a majority of your colleagues.  You can begin improving your tactical capabilities by helping to prioritize future analytical projects and showcasing successful outcomes.  Publicizing success stories and informing people where they can go for help is also a great way to get upper management interested.   You might even be able to start a monthly user group to discuss analytics usage within your company.  These sorts of activities will ultimately propel you and your company toward being a more sophisticated analytics worker and more analytics-driven.

Automating data preparation is fundamental to making analytics professionals more productive

For some higher levels of analytics functioning, and once analytics professionals have been identified as having common data consumption needs, efforts need to be made to automate time-consuming data preparation steps that are precursors of more advanced modeling efforts.   We estimate that a full 80% of model development effort is spent shaping and preparing the data for predictive analysis.  The implication is that the IT department and a variety of business units need to be brought together in a true collaborative environment in order to reduce redundant efforts and make data preparation tasks more efficient.  IT professionals are usually unaware of how to best support advanced analytics needs, and an educational effort has to be the focus of getting groups of people with different goals and agendas to start talking with one another.  At SAS we offer assessment workshops to help with this collaboration, and there is a nice article written by Dr. Anne Robinson, Director of Analytics at Verizon Wireless, where she shares five tips for bridging the business-IT gap.

Next time I will lay out some distinctions that can be used to characterize different levels of analytics maturity, as well as discuss why assigning labels to one group or to a company can be a difficult and sometimes counter-productive endeavor.  Please stay tuned!

Meanwhile, what have you found to be best practices?

How to Hire a Chief Data Scientist

How do you hire a Chief Data Scientist? That's not a hypothetical question: I know of at least three companies that are actively looking for a "Chief Data Scientist" at the moment. Hiring the right person is harder than you'd think.

Whether or not a Chief Data Scientist is a necessary role for every organisation out there is an interesting question but arguably irrelevant - in terms of revealed preference, it's enough that they're recruiting.

Hiring a Chief Data Scientist opens up a whole nest of problems. The obvious one is structural: there's just not enough people to go around. For those who are competitive and looking for a challenge, it really is a world market at the moment. Since looking locally rarely yields enough candidates given the dearth of the pool, it almost inevitably ends up being a global search as well.

There is a subtler issue: defining the role is hard! After all, it's not like there's a library of position descriptions out there. And, there's surprisingly little consensus about what the job should involve - ask five people what they think makes a good data scientist and you'll probably get seven answers!

The obvious is to assume that it's a like a data scientist, just smarter. This would suggest one (or more) PhDs in technical areas, deep experience in "bare metal" algorithm development (in MapReduce, C, or any other often low-level language), and a demonstrated history of applied mathematics in a business or commercial context.

I think this is a mistake. I'm not alone, either: management is always different from application and analytics is no different. Information technology and marketing moved through similar transitions as they gravitated away from finance and sales.

Today's Chief Information Officer and Chief Marketing Officer need more than just technical skills. Anyone who suggests with a straight face that the main measure of an effective Fortune 100 CIO is their ability to cut code will probably get laughed out of the building. The same goes for the CMO – the ability to develop creative content is important but it pales in comparison to driving marketing return on investment.

From where I sit, the ideal Chief Data Scientist blends three different competencies:

  • Technical skills
  • Value creating skills
  • Change management skills

Technical skills are an essential part of the job. It's hard to be credible in the field if one can't spell heteroscedasticity off the top of one's head. However, breadth of knowledge is arguably more important than depth in this space.

Depth is great when every problem is the same. While there are a wide variety of problems that benefit from common competencies, the set of problems that do not benefit from common competencies is a larger one. Besides, it's relatively easier (and cheaper) to hire smart people fresh from university or consultancies. Why pay top dollar at the C-level when there's a hoard of PhD graduates who would leap at a (proportionally tiny) offer 10% higher than market rates?

Arguably more important than technical knowledge is the ability to channel that knowledge into value-creating initiatives. Getting past insight is a hard lesson to learn: it may be a great model but if it doesn't add anything to the bottom line or to social outcomes, no-one's going to care. This awareness isn't cheap, either. Anyone who's deployed operational analytics and analytically-driven microdecisions has more than a few battle-scars and stories to talk about.

Finally, and most importantly, the job is about leading and managing change. The heart of business analytics isn't maths. It's about getting everyone to behave differently based on data-driven insight.

Analytics helps identify how things could be better. Business analytics is about convincing everyone that it's worth doing things differently. Change management and persuasion are the two most important skills in the field and yet frequently paid the least attention. An effective Chief Data Scientist needs to live them, heart and soul; if they don't, who will?

So how do you hire the ideal Chief Data Scientist? Here are my tips:

  • Look for people with a demonstrated history of driving change. They're going to need to convince the organisation to behave differently, something that's notoriously hard.
  • Grill them on their technical chops. Or, even better, get your most technical analytical people to do it for you. It's a technical discipline and there's more than a few people out there happy to take advantage of information asymmetries if it means they'll get a nice title and a fat pay check.
  • Communication is an essential part of the job. So is innovation. Use social media to look for people with a broad and diverse network and then test them on their implied relationships. On one hand, it's hard to build a real network without some social mores. On the other, innovation is easier when it draws from fresh exposure to fresh ideas.
  • Make sure they understand what you're looking for and that you know what they're looking for. If you're after a change agent, getting someone who just wants to patent new ideas may end up in buyer's remorse.

What do you suggest?

Divide and Conquer: Segmenting Time Series for Improved Forecasts

We all have some sort of intuitive idea of what time series data is – it’s a bunch of measurements or observations that are marked by a time stamp – we know when the measurement was taken, as well as what was measured. This natural temporal ordering of the data is a vital component for statistical forecasting: we are searching for patterns that change systematically over time. In order to do a good job in forecasting we need to know about the structure of our data.

Ideally, the time series would look like this: well behaved, few missing values, ample history, a nice repeating pattern – this is the sort of data that can be fed to an automated forecasting environment such as SAS Forecast Server, and good results obtained. But life – and forecasting – isn’t always that neat. In almost every forecasting project I have been involved with we were faced with all kinds of not-so-well-behaved series: slow moving items (intermittent demand), new items (no or very short amount of history), end-of-life-cycle items (demand drops to 0), fashion items (short history), items which are only sold during a particular time during the year (think Halloween costumes). Is it reasonable to assume that a one-size-fits all approach will come up with good forecasts for all these different series? I don’t think so.

You have heard about the good old saying that 80% of the time for any analytical task is spent with data preparation tasks. I would argue that all decisions about how to structure and segment your time series data are as important as the forecasting methods themselves in creating the best possible forecast. Note that this goes beyond traditional tasks like missing value replacement, transformations and outlier detection. Each type of demand pattern described above will require a different model strategy, so it makes a lot of sense to separate the data into distinct types or segments. For example

  • Complete series – apply robust time series forecasting methods such as ARIMA models.
  • sparse data – apply intermittent demand forecasting methods such as Croston’s method or consider this question as an inventory policy problem (instead of asking how much demand are we going to have, calculate appropriate amount of items to buffer against the risk of running out of stock)
  • short data – forecast using similarity analysis techniques

Other, more complex strategies can be applied: segment by seasonality pattern, segment by contribution to revenue, or segment by “forecastability”. For each of the segments different SAS Forecast Server projects can be created and different modeling strategies can be applied. To support the forecasting analyst with this task we have released SAS Time Series Studio – a new Graphical User Interface in SAS Forecast Studio 12.1.

SAS Time Series Studio is designed to do:

  • Explore time series data interactively
  • Explore multiple time series simultaneously
  • Explore impact of setting up different hierarchies
  • Explore impact of different time intervals
  • Segment data based on findings

My colleague Meredith John and I had the pleasure of introduction SAS Time Series Studio at A2012 in Las Vegas. In our presentation we discussed the importance of understanding the structure of time series data for generating forecasts. I encourage of users of SAS Forecast Server (and those who will be) to check out SAS Time Series Studio themselves and share their experiences with us. In fact, you might to twitter you comments using #sastss.