How to build a data science dream team

475967869As a data scientist, I have the rare privilege of possessing the job title that Tom Davenport and others have dubbed the sexiest job in the 21st Century.  As this popular job title catches on, I’ve even noticed a trend where customers make direct requests for help specifically from “the data scientist.”

Friends who don’t understand what I do often ask if I wear a lab coat to work. Of course, I don’t, but at times I have considered it! My typical day involves helping colleagues and customers find solutions to their most critical business needs through the power of analytics.

Though the title has gained rock star status in some circles, the truth is, data science and analytics cannot be successfully executed by data scientists alone. We are only one piece of the puzzle. Success in data science and analytics requires an entire team of employees in diverse roles. The data scientist would inevitably fail in solving critical business issues without a full team engaged in the analytics process.

Read More »

Post a Comment

How to make your data available, interconnected and usable

It's a common problem: Your organization collects vast amounts and types of data – but it's spread across different departments and locations in various formats and systems. It’s a massive challenge to make all that data available, connected and usable to everyone who needs it.

Empowering Enterprise Decision Making

Empowering Enterprise Decision Making

That was the scene at North Carolina State University. But Marc Hoit, Vice Chancellor for Information Technology and CIO came up with a solution that creates a holistic view of diverse university data spanning departments, campuses, functions and activities.

Their existing systems had built-in reporting, but those reports were confined to individual data silos. For example, some departments had custom reports just for them. Hoit’s group wanted to link all the data and put it into usable formats for analysis and reporting – potentially transforming enterprise decision making across the university.

In order to do this they needed to reduce the number of reports. They wanted the reports to be more flexible and for users to become self-sufficient. After all, being able to access and analyze all that data – even if it seems unrelated – has big implications for students, educators and even entire educational systems.

To learn more about what Marc Hoit was able to accomplish at North Carolina State University read this paper - Empowering Enterprise Decision Making: How North Carolina State University Makes Data Available, Interconnected and Usable. He shares some great insights, advice and lessons learned.

Post a Comment

Get the inside track on the UK's General Election result

Bookies have long turned a trade in predicting the fate of our politicians in the general election. According to Ladbrokes, gamblers are set to spend a staggering £100m betting on this year’s result.

The outcome of the May 7 vote is anticipated to be the hardest election to predict in recent memory. For the first time ever it’s conceivable that the joint vote share of the two main parties might be under 60 percent.



In 2012, Nate Silver, author of the exceptional data blog FiveThirtyEight, famously used the same analytic models he applied to sports betting to predict the US presidential election result. Crucially, each contest had a relative likelihood of success, and mounting those probabilities up across the whole election returned a remarkably accurate result.

However, the UK political scene has become a little more complex. No longer a simple red vs. blue contest, parties historically considered to be on the fringes have taken the fight to the incumbents. The Liberal Democrats (although in fairness an incumbent themselves), Scottish National Party (SNP), United Kingdom Independence Party (UKIP), Green and Plaid Cymru (Party of Wales) have divided voters and complicated our forecasting models.

The UK system dictates that the party with the most seats wins, but each of the 650 seats have to be fought for one by one. The importance of a majority in each location or constituency adds another layer of intricacy to our calculation.

Historically we’ve been able to predict seat outcomes by factoring in the change in the opinion poll as compared to the last election. For example, if one party won the seat with 40 percent of the votes, and their opinion poll rating has dropped by 10 percent you could reduce that 40 percent by 10 percent, giving 36 percent, meaning they may lose the seat. However now that we’re looking at a six party race, split across 650 seats, a more intricate model is required. In a bid to show what Parliament would look like based on the latest polls, The Guardian has produced an interesting projection methodology.

Whilst the model has become more complex, the good news is that there are a number of data points that we can add into our calculations. Opinion polls are the most traditional source of up-to-date information on which way the public is leaning. However polls can occasionally mislead us as they did in 1992, where the final polls predicted a 1.4 percent Labour lead but the Conservatives won by 7.6 percent. Betting markets are often touted as a reliable source, with Professor Leighton Vaughan Williams, director of the Political Forecasting Unit at Nottingham Business School, claiming they are more accurate than the polls.

BigBenThe emergence and mining of social media data can track party and voter sentiment. For the first time in the UK, apps are available which enable the general public to follow the trends and gain insight into the mood around the main parties. However, social media tends to be a fairly biased sample and can mislead. When it came to the referendum on the Alternative Voting system in the 2011, social media suggested a big win for AV, whereas in fact the status quo won out. To make meaningful predictions about the result in constituencies, or indeed nationally, you need the capability to analyse a much wider pool of data.

Search engine data sourced remarkably accurate results for the referendum on Scottish Independence. However, given certain party leaders’ penchant for headline grabbing statements, search volumes could be more of an indicator of celebrity than of potential success.

In the next blog, I’ll look at how SAS can access open source map data to begin to translate sentiment into seats. But in the meantime, if you’re keen to find out more about data science and the government, check out our research with Civil Service World on Big Data in the public sector.

Post a Comment

Oilfield contextual analytics: Do you want to play a game?

In the Cold War techno-thriller WarGames, a marine monitoring a nuclear missile silo deep under the Nevada desert sees a red warning light blink on his console. “Just flick it with your finger,” his colleague tells him. He does, and the bulb goes out. Problem solved.

But what will their supervisor, looking at a report later, make of this brief alert? Without “situational awareness” of the event – handled with a quick verbal exchange in the room – the manager would be left to infer whether America was at risk of a nuclear incident or just had a bad solder joint. The risk that he could misinterpret the data, out of context, is no small matter.

This lack of context around data translates easily from Hollywood to oilfields. As Big Data has grown bigger, industrial operations are compiling and analyzing enormous volumes of information, usually in real time and often remotely. Much of the human interaction that gives meaning to data fails to accompany those metrics as they are compiled and analyzed.

I recently asked my colleague Moray Laing, an oil and gas expert here at SAS, to help me connect the dots. He told me that, without context, the quest to attain “a single source of the truth” from data will remain unfulfilled. “That’s the missing element. Although activity is partially handled in the existing data stream, situation-based information tends to be handled by the human being – not just what am I doing, but what is the intent of what I’m trying to achieve?”

Laing, a former long-time Baker Hughes engineer, reminded me that the Big Data dynamic has grown from original three V’s – volume, velocity, variety – to seven, including veracity. The term is a nod to the impact that imprecise or uncertain data can have on accurate analysis of data, especially when it's flowing quickly and in large volumes. Contextual analytics represent a new frontier for companies attempting to capture this new element of big data and turn it into useful information. “Knowledge management is a key piece of the oil and gas industry,” Laing says. “We have tons of data both in stream and stored at the enterprise, but how do I unlock the knowledge? I think it comes back to that fourth V, veracity. To have the data enriched by context is one of the best ways to do that.”

Examples for improving field operations with contextual analytics abound. Here’s one: Drilling is a complex operation fraught with hazards where multiple objectives are constantly being balanced. During an influx event a remote operations team, or even an embedded control algorithm needs both situation and intent to be able to provide not just the right advice, but the appropriate advice. The situation is “the driller is trying to control an influx” enrichment comes with the intent “the driller is using a two circulation procedure”.

Laing tells me that consistent standards in communicating the context of a situation in the field are the one of the biggest gaps in oil and gas companies’ ability to capture the veracity of big data. Once context is captured , calculated, categorized and stored in a consistent fashion as part of the data stream coming off an oil well operation, it opens the door to a level of automation that can reduce risk and cut the cost of drilling on those wells. Contextual analytics takes us one step closer to an intelligent drilling rig that reduces cost as well as mitigating risk.

Contextual analytics could also tell that missile silo manager in Nevada: Call an electrician.

Post a Comment

Big data research explains spicy curry and a thrilling novel


Like all scientific breakthroughs, there needs to be some sort of experiment or evidence gathering to prove a hypothesis. Sometimes these breakthroughs are unrelated to the original hypothesis and are made by accident - as long as there’s some form of information to analyse, there’s scope for discovery. With so much of an ordinary person’s life now open to analysis by the data they leave behind, we are beginning to make breakthroughs that explain everyday life experiences.

Recently a piece in The Hindu reported a study that used data analytics techniques to establish an unusual feature of Indian cuisine. It found that, whereas most other global cuisines rely on positive food pairing – the pairing of similarly flavoured ingredients - Indian cuisine instead relies on negative food pairings using dissimilarly flavoured ingredients. They also discovered, by shuffling around ingredients in a recipe to observe its effect on negative food pairing, that it was the spice that drove the negative pairing. Of the top 10 ingredients whose presence biased the flavour-sharing pattern of Indian cuisine towards negative pairing, nine were spices.

Indian cuisine is much more complex as 20 ingredients may be used for a dish compared to, say, five for a typical Western dish. When you consider this, and the variation in Indian cuisine across regions and groups, there are multiple possible combinations of ingredients which may all get a different reaction depending on who is tasting them. It demonstrates how Indian recipes could be personalised according to what spice combinations a person prefers.

Novel research

So where do novels come in? It so happens other research out around the same time revealed there are six basic plots to any novel. As reported in the UK’s The Times newspaper, Matthew Jockers, a professor of English at Stanford University, discovered this after quantitative analysis of more than 40,000 novels. Plots follow six different patterns and can be represented graphically by ‘Good Fortune’ on the Y-axis measured against time. There are six basic curves which reflect plots that are either ‘Man in Hole’, including ‘Man gets into trouble and man gets out of it’, and ‘Man on Hill’ in which the main emotional content is positive. Novels are split roughly equally between these two types, but with distinct groups within each. ‘Moby Dick’ for example, is a variant of the ‘Man in Hole’ plot, but with a longer period in the hole compared to many other novels.

So again, this indicates more information could be used to better target offers to consumers. Instead of fairly crude offers where suggestions are all crime-related novels as the reader last ordered a crime novel, offers could also draw on what plot types the individual prefers, regardless of subject-matter.

These examples highlight how it’s possible to draw on many different types of data, not just the traditional ones around age, sex, disposable income, buying history etc, to establish what tastes and preferences a particular consumer has. This big data can be invaluable to retailers wanting to deliver that personalised shopping experience needed to steal a march on the competition.

The good news is, the technology already exists to collate and analyse all this data in a matter of seconds or minutes, giving retailers the insights they need to offer the right product at the right price at the right time via the right channel.

Find out more about why personalisation is critical for today’s marketers.

Post a Comment

The tale of two innovation labs for big data

468837845Recently I have been out speaking with a number of organizations about the idea of the innovation lab concept , which I discussed in a previous blog post, as the way to unleash the power of big data and make even the largest of companies as agile as a startup. During my discussions there are a couple of things that I am observing, that I wanted to share with you, since it seems there are different types of innovation labs in organizations:

  1. Many companies have something I am going to label an IT innovation lab where they are experimenting with "big data" technologies. These IT innovation labs are NOT the same as the “data related” innovation lab that companies need to put in place to remain agile in this new digital world. The focus of the IT innovation lab is to test the technology and its integration whereas the focus of the "data related" innovation lab is to test hypothesis around mashups of data and different analytical approaches In the digital world, information gleaned from data is your best competitive weapon, and speed is a critical component to your success.  It is my opinion these should be separate and distinct in a companies strategy as each has a role to play. This post will focus on the tale of two labs and how they differ.
  2. Tight budgets, and the significantly different focus of the “data focused” innovation lab, are causing organizations to ask for support to obtain funding given it is generally a new concept to not have a concrete business problem solved as part of asking for an investment. A second post in this series will focus on how I suggest organizations can build the business case for the data focused innovation lab which I believe is vital to the future success of all organizations no matter how large or small they are.

Read More »

Post a Comment

Six analytics lessons from the automotive roundtable

Increasingly, automotive executives want to talk about the "Art of the Possible" in analytics. So we took the opportunity to invite leaders around the industry to an Automotive Analytics Executive Roundtable to share their stories and spark new ideas. A myriad of diverse speakers covered a variety of topics on big data/Hadoop, the connected vehicle, customer experience, building a culture of analytics, and innovative retail analytics.

To open the day, we discussed how analytics can address key issues facing the automotive industry. One topic was the use of analytics to address changing consumer access and rise of consumer voice and influence through online channels which enable better insight to product design and quality. With global manufacturing supply chains, the challenges and opportunities to adequately plan, forecast and optimize inventory and pricing are numerous. As we looked at "connected everything," we explored use cases for the connected vehicle, factory, dealer and consumer. None of these are possible without the use of big data analytics. Keep reading for my top six lessons from the day that I hope you can use too: Read More »

Post a Comment

What can we learn from the world's largest repository of international trade data?

Did you know the US-Malawi trade relationship is based almost entirely on tobacco? Or that apparel & accessories drive Morocco & Tunisia exports while their North African neighbors rely on oil?

Trade balance report showing significance of tobacco to US-Malawi trade (click to enlarge)

Trade balance report showing significance of tobacco to US-Malawi trade (click to enlarge)

The United Nations Statistics Division collects detailed international trade data from more than 200 countries on the import and export of hundreds of commodities. Talk about Big Data... Until now, the UN Comtrade database has been too large for most people or organizations to consume and analyze in its entirety.  Thanks to a new collaboration with SAS, the data is now available to everyone, via a web browser or mobile tablet, and unlocks valuable information that benefits policy makers, businesses, researchers and the general public. A few of the many possible use cases include:

  • A trade minister can interrogate decades of international trade data on a mobile tablet.
  • A global business can access immediate insights on risks and business performance in local markets
  • A university student can gain the valuable experience of mining millions of rows of data for hidden trends and stories

SAS Visual Analytics for UN Comtrade brings Big Data to the masses, enabling anyone to glean insights from the most comprehensive collection of international trade data in the world...300+ million rows.  Employing high performance data visualization, big data and cloud computing technologies, this online service exposes stories hidden across hundreds of trading partners and thousands of commodities to reveal how nations have interacted economically through the last three decades.

What stories can you uncover? How much oil does your country import? How much oil do your trade partners say they export to you? Is there a disparity? The mirror statistics raise plenty of questions.

Users can explore the date through several views, including:

  • Imports/exports: Top importers and exporters by world, region and country, and what commodities they trade.
  • Trade balance: Top commodities bought and sold on a global or country level.
  • Trade composition: Most frequently traded commodities by partners.
  • Mirror statistics: A comparison of import/export data, as reported by partners on both sides of a trade relationship.
  • Trade history: Top trading partners for any given country, with an animated bubble plot tracking relationships over the time.
  • Data: All data, presented in tabular format with powerful filters
  • Historical analysis: See trade history across any combination of partner(s), commodities and year(s)
Users can drill into data on the top importers and commodities (click to enlarge)

Users can drill into data on the top importers and commodities (click to enlarge)

SAS offers many of these unparalleled interactive visualizations for free at the link above. It is our contribution in advancing the global data revolution. We’re not just making the data public, we’re making the insights public – for the good of society.

In my next blog post, I will share lessons learned from this journey with the UN to help with their Big Data visualization initiatives.

What can you learn from the data? What surprised you? Please share your insights!

Post a Comment

The threat from within: battling internal fraud

WalletFraud is a growing problem for businesses – and one of the biggest threats comes from an organisation’s own employees. In many countries, the incidence of internal fraud is rising. According to the Credit Industry Fraud Avoidance System (CIFAS), in the UK alone there was an 18 percent rise in the total number of staff frauds recorded in 2013 when compared to 2012.

It is a problem that differs from territory to territory. PwC’s 2014 Global Economic Crime Survey revealed that South African organisations suffer “significantly more procurement fraud, human resources fraud, bribery and financial statement fraud than organisations globally.” Equally, according to CIFAS’s Employee Fraudscape report, published in April 2014, the number of unsuccessful employment application frauds in the UK increased by over 70% in 2013 compared with 2012.

The problem is becoming a priority for many organisations - but the main area of focus differs from country to country. The spectre of financial loss is critical everywhere - but in many places it’s outweighed by the fear of reputational damage. In the UK and the US, where we have recently seen multiple market abuse and unauthorised trading cases hitting the headlines, there is a strong emphasis on addressing regulatory requirements.

Read More »

Post a Comment

When the going gets tough, the tough use analytics

When do analytics really provide value? All the time, of course. However, one of the best times for analytics to prove their value is when you are asked to do more with less.   Often, the reason we are asked to do more with less is because of an economic downturn for our company, industry, or the economy as a whole. It is ironic that investing in analytics, which could have a meaningful impact on how well our organization functions long-term, is sometimes only considered when times are good and resources are plentiful, not when times are tough.  However history shows us time and again that analytics can help you the most by allowing you to do more with less.

Faster analytics provide even more value

High perCompress_Timeformance analytics makes it even easier to do more with less (especially in less time).  As you know, time is a limited resource that we just can't get back and for many decisions, timing is crucial. If it takes too long to get analytic insight needed to make a decision, it doesn't matter how great that insight is. When it's too late, it's too late.

The value that high-performance analytics adds on top of the better insights you get from predictive and prescriptive analytics is the ability to "compress time" and deliver those insights quicker, thereby improving their value to the overall business decision process already in place.

The history of fast analytics

If you think about it, this is really why computers were first invented: to deliver insight faster than previously thought possible.   Alan Turing developed the Turing machine, considered to be a model for the general purpose computer, to speed up the process of decoding the German military encrypted messages created by their Enigma machines in World War II.  The use of mathematics to help recognize patterns in these messages (analytics) coupled with high performance (computer processing of the Turing Machine) means that the earliest form of high-performance analytics helped bring about the defeat of the Axis powers in World War II.  Another fascinating example of analytics helping the allies win WW II can be found in Jordan Ellenberg's "How Not To Be Wrong - The Power of Mathematical Thinking".  In it you can read how Abraham Wald, a member of the Statistical Research Group (SRG), a (at the time) classified program tasked with helping the war effort, solved the issue of where best to add more armor on airplanes to decrease the number of planes being shot down.

Talk about analytics having a meaningful impact!

Imagine how you might apply analytics to help your organization solve problems or improve efficiency.    Bob Dudley, BP Group Chief Executive, provides a potential case in point in Oil and Gas with this statement from his speech he gave earlier this March at the Mexican Energy Reform Summit 2015. "If the global energy environment was highly competitive before - at $100 a barrel - it just got ultra-competitive at $50 to $60 a barrel."

This is a perfect example of where analytics can have a big impact for the oil & gas upstream processes. By helping to reduce the overall costs with getting oil out of the ground and then improving the processes of getting it to market, companies can potentially improve profits in this down market.   Read more about how analytics can help reduce costs in upstream exploration and production(E&P) in my previous post on "Analytics is an enhanced oil recovery process" and Keith Holdaway's book, "Harness Oil and Gas Big Data with Analytics." Interested in how analytics can help in the downstream process then see my colleague, Charlie Chase's recent post entitled "How to use analytics with O&G downstream data to improve forecast accuracy."

Post a Comment