There she blows! – there she blows! A hump like a snow-hill! It is Hadoop!
Bringing your data and processing to Hadoop can sometimes feel like an insurmountable task – but it doesn’t have to be that way. The same technologies and capabilities that have powered SAS Data Management for over a decade can make wielding the power of Apache Hadoop more like a pleasure cruise and less like hunting a great beast. From my experiences in working with SAS and Hadoop, I'll describe four ways SAS can make Hadoop easier.
Access to Hadoop can be challenging for a variety of reasons (location, security, data format, data transport and user skill set). SAS foundation tools (in particular, SAS/ACCESS© interface technologies) let users access data in a number of ways. These technologies are developed in partnership with Hadoop vendors to allow deep integration with a data system. SAS can increase efficiency by making native connections for data transfers to Hadoop Distributed File System (HDFS) and allowing direct access to HDFS data. This implementation enables users to access their data in Hadoop from a desktop or a remote server’s web user interface. Security can be applied on the server, on the client, or both – depending on IT security requirements. In turn, IT has flexibility to keep the data available as it prevents mixing of sensitive data.
In the early days of Hadoop, there were limited options for formatting data – Hadoop offered few data types at that time. But our customers overcame those challenges using SAS formatting to change native Hadoop data types from string, for example, to other data types that suited their processing. As Hadoop matured to support new data types, SAS complemented the HDFS by providing formats better suited for analytics (high-performance data format) and analytic data preparation (SAS Scalable Performance Data Engine). These formats provide coherence of storage, required data types and metadata – increasing the efficiency of data processing tasks.
Houston, we have a problem…we’re just not sure where. Often, finding the problem can be just as challenging as fixing it. In the world of database management systems, we have a mature data dictionary for storing descriptive statistics of the data. Table storage volume, table column data types, average column value, standard deviation, null count per column or null percent per column, and median value per column (just to name a few). This metadata does not exist in a single place or a unified form in Hadoop.
SAS metadata is a big strength of our data management offerings. SAS can gather metadata across many systems to make activities like data migration, data processing and lineage easy. Data profiling gives users the capability to pull the metadata in Hadoop to assess the quality of their data. Are there any patterns in the data? What's the state of the data's incompleteness? Are there any trends in the quality of the data? Does the data contain personally identifiable information (PII)? These are questions that can be answered using SAS metadata and profiling tools.
Hadoop deals in the arena of challenges with large volumes of data – it's not well suited for small data problems. To support all the data required by the business, it's key to use a tool set that lives in both the big data and the tiny data arenas. SAS Data Management provides both extract, transform, load (ETL) and extract, load, transform (ELT) processing capabilities. This allows SAS to transform and blend data outside of Hadoop from files, database management systems (DBMS), streaming data or master data systems, just to name a few. Who is doing the work is just as important as where the work is taking place. No matter whether the work is being done by the enterprise, the business unit or a collaboration between the two, SAS technologies enable seamless alliance for working with data in Hadoop.
Once users understand their data quality issues, we can work to correct them. That’s the easy part, right? Hadoop has been around for just over a decade and still lags behind most DBMS SQL function sets. Then there's the issue of the coding languages that are available. An Oracle DBMS developer may run into issues working with data sets migrated from Oracle into Hadoop Hive with HiveQL, MapReduce and Apache Spark. Further, there are no native data quality procedures in Hadoop today.
The SAS Quality Knowledge Base provides a rich set of files that store definitions for performing various data cleansing tasks. Performing standardization, semantic parsing, clustering or field extraction are as easy as function calls through the SAS programming language.
Overcoming the challenges
Many customers encounter these four bottlenecks when they first start working with a Hadoop migration. With SAS technologies, we can make Hadoop approachable and accessible for a variety of enterprise and business needs. And you won't need to learn complex skills to succeed.Download a free paper – Bringing the Power of SAS to Hadoop