The snap, crackle and pop of data management on Hadoop

Data. It's everywhere. It can reside in many places through replication, accessibility needs or infrastructure costs. For reporting, that same data can be structurally changed (denormalized or aggregated) into additional reporting and analytic data repositories.

Over time, new sources of enrichment of that data become available through traditional data sources like new transaction systems or through web data sources like Twitter, LinkedIn or Facebook. Hadoop presents an attractive, low cost, target storage and processing system for all the data sources…the data lake.

SAS supplies several processing paradigms to help with the integration and transformation for all the data. SAS Hadoop processing comes in the form of FROM (access and extract data from Hadoop), WITH (access and process data in parallel and in memory), and IN (access and process the data in the Hadoop cluster). Let’s look at these one by one from a data management perspective.

The SAS FROM story allows more than a simple data connection to the Hadoop cluster. The SAS/ACCESS engine for Hadoop comes in two flavors; one for accessing Hive and one for accessing Cloudera Impala. The SAS/ACCESS engine collects SAS metadata on the objects you build in Hadoop, making those objects available throughout your data flows. The intelligence behind these connectors make performance decisions (like utilization of a direct HDFS connection) seamlessly – without user intervention.

The SAS WITH story provides transformation capabilities not yet available in Hadoop. UPDATE and DELETE are standard SQL transformations used in a variety of data processing programs. Hive does not yet support these functions, but you can utilize PROC IMSTAT (part of the WITH story) to lift a table or partition into memory and perform these functions in parallel. The table or partition could then be reincorporated into the Hive table, alleviating the need to truncate and reload from an RDBMS data source.

Finally, SAS IN provides quality and coding capabilities for data management. The SAS Data Quality Accelerator for Hadoop (part of SAS Data Loader) allows a user to run eight different functions in parallel against data tables in Hadoop. Those parallelized functions are casing, extraction, gender analysis, identification analysis, match code, parsing, pattern analysis and standardization. With the SAS Code Accelerator for Hadoop, a user can take advantage of the rich DS2 language to perform difficult transformations (like a pivot or transpose) of a table in parallel.

The next time you’re considering some difficult transformations or integration of Hadoop data objects metadata for data flows, think snap, crackle, pop or better yet…FROM, WITH and IN!