Demand for analytics is at an all-time high. Monster.com has rated SAS as the number one skill to have to increase your salary and Harvard Business Review continues to highlight why the data scientist is the sexiest job of the 21st century. It is clear that if you want to be sexy and rich we are in the right profession! Jokes aside I have spent the past five weeks travelling around Australia, Singapore and New Zealand discussing the need to modernise analytical platforms to help meet the sharp increase in demand for analytics to support better business and social outcomes.
While there are many aspects to modernisation, the most prolific discussion during the roadshow was around Hadoop. About 20% of the 150 plus companies were already up and running with their Hadoop play pen. Questions had moved beyond “What is Hadoop?” to “How do I leverage Hadoop as part of my analytical process?”. Within the region we have live customers using Hadoop in various ways:
- Exploring new text based data sets like customer surveys and feedback.
- Replicating core transaction system data to perform adhoc queries faster. Removing the need to grab extra data not currently supported in the EDW.
- The establishment of an analytical sandpit to explore relationships that can have an impact on marketing, risk, fraud and operations by looking at new data sets and combining them with traditional data sets.
The key challenge discussed was unanimous. While Hadoop provided a low cost way to store and retrieve data, it was still a cost without an obvious business outcome. Customers were looking at how to plug Hadoop into their existing analytical processes, and quickly discovering that Hadoop comes with a complex zoo of capabilities and consequentially, skills gaps.
Be assured that this was and is a top priority in our research and development labs. In response to our customers' concerns, our focus has been to reduce the skills needed to integrate Hadoop into the decision-making value chain. SAS offers a set of technologies that enable users to bring the full power of business analytics functionality to Hadoop. Users can prepare and explore data, develop analytical models with the full depth and breadth of techniques, as well as execute the analytical model in Hadoop. It can be best explained using the four key areas of the data‐to‐decision lifecycle process:
- Managing data – there are a couple of gaps to address in this area. Firstly, if you need to connect to Hadoop, read and write file data or execute a map reduce job; using Base SAS you can use the FILENAME statement to read and write file data to and from Hadoop. This can be done from your existing SAS environment. Using PROC HADOOP, users can submit HDFS commands and Pig Scripts, as well as upload and execute a map reduce tasks.
SAS 9.4 is able to use Hadoop to store SAS data through the SAS Scalable Performance Data (SPD) Engine within Base SAS. With SAS/ACCESS to Hadoop, you can connect, read and write data to and from Hadoop as if it were any other source that SAS can connect to. From any SAS client, a connection to Hadoop can be made and users can analyse data with their favourite SAS Procedures and Data Step. SAS/ACCESS to Hadoop supports explicit Hive QL calls. This means that rather than extracting the data into SAS for processing SAS translates these procedures into the appropriate Hive‐QL which resolves the results on Hadoop and only returns the results back to SAS. SAS/ACCESS to Hadoop allows the SAS user to leverage Hadoop just like they do with an RDBMS today.
- Exploring and visualising insight - With SAS Visual Analytics, users can quickly and easily explore and visualise large amounts of data stored in the Hadoop distributed file system based on SAS LASR Analytics server. This is an extremely scalable, in‐memory processing engine that is optimised for interactive and iterative analytics. This engine addresses the gaps in MapReduce based analysis, by persisting data in‐memory and taking full advantage of computing resources. Multiple users can interact with data in real‐time because there is no re‐lifting data into memory for each analysis or request, there is no serial sequence of jobs, and computational resources available can be fully exploited.
- Building models – SAS High Performance Analytics (HPA) products (Statistics, Data Mining, Text Mining, Econometrics, Forecasting and Optimisation) provide a highly scalable in‐memory infrastructure that supports Hadoop. Enabling you to apply domain‐specific analytics to large data on Hadoop, it effectively eliminates the data movement between the SAS server and Hadoop. SAS provides a set of procedures that enable users to manipulate, transform, explore, model and score data all within Hadoop. In addition, SAS In‐Memory Statistics for Hadoop is an interactive programing environment for data preparation, exploration, modelling and deployment in Hadoop with an extremely fast, multi‐user environment leveraging SAS Enterprise Guide to connect and interact with LASR or take advantage of SAS’ new modern web‐editor, SAS Studio.
- Deploying and executing models - conventional model scoring requires the transfer of data from one system to SAS where it is scored and then written back. In Hadoop the movement of data from the cluster to SAS can be prohibitively expensive. Instead, you want to keep data in place and integrate SAS Scoring processes on Hadoop. The SAS Scoring Accelerator for Hadoop enables analytic models created with Enterprise Miner or with core SAS/STAT procedures to be processed in Hadoop via MapReduce. This requires no data movement and is performed on the cluster in parallel, just like SAS does with other in‐database accelerators.
To be ahead of competitors we need to act now to leverage the power of Hadoop. SAS has embraced Hadoop and provided a flexible architecture to support deployment with other data warehouse technologies. SAS now enables you to analyse large, diverse and complex data sets in Hadoop within a single environment – instead of using a mix of languages and products from different vendors.