Self-service data preparation transforms data professionals into data rock stars

Transform data professionals into rock stars
Transform data professionals into data rock stars

When my band first started and was in need of a sound system, we bought a pair of cheap yet indestructible Peavey speakers, some Radio Shack microphones and a power mixer. The result? We sounded awful and often split our ear drums from high-pitched feedback and raw, untrained vocals. It took us years of practicing and playing out before we stepped up to the right gear – a Mackie 808s power mixer, a pair of 15" Community speakers and a Shure Beta 58 microphone. The solution was lightweight, enriched our vocals and suppressed feedback. It made playing out a pleasure and fans started actually coming to gigs. Whether you're singing karaoke, ripping a guitar solo live to hundreds of fans or cleaning up your customer data on Hadoop, having the right gear is essential.

Data professionals, including business analysts and data scientists, face similar struggles. IT lacks the agility to respond quickly to their data requests. The data is often raw, noisy and needs to be enriched. And these data professionals lack the skills to cleanse, manage and transform that data on Hadoop. As a result, more time is spent preparing data than generating insight.

Analysts have identified a trend toward self-service data preparation tools that give data professionals the data they need to make decisions. SAS Data Loader for Hadoop fills this need, tackles the Hadoop skills shortage, and empowers business users and data scientists alike to prepare, integrate and cleanse big data faster and easier – without writing code.

The latest release of SAS Data Loader helps organizations:

  • Speed data management processes.
  • Improve data professionals' productivity.
  • Manage data where it lives.

Speed data management processes with Spark

In addition to being able to read and write Hadoop Spark RDDs (resilient distributed data sets), SAS Data Loader can now run all data quality functions in-memory using Spark for improved performance. One example of this is the Cluster and Survive data directive that was created using SAS' years of expertise in bringing identity resolution and master data management (MDM) solutions to market. This MDM for big data capability is delivered as a simple wizard-driven directive that groups similar records and then collapses them into one "best record" based on a set of business rules.

Improve data professionals' productivity with Impala and chaining directives

In addition to running faster in-memory, queries can now be designed using Impala functions, which run SQL-like commands on Hadoop up to 60 times faster. We can also chain or group multiple directives together and run them in parallel or sequentially on Hadoop. For example, you may want to copy an Orders and Customers table from an Oracle database or a SAS data set into Hadoop; clean up state codes on the customer data; merge the data together; and then lift that data into SAS LASR Analytic Server for visualization or analysis using SAS Visual Analytics. This entire set of 5 or 6 jobs can be executed together  – and, with the new public REST API, can be run by an external job scheduler overnight to further automate that process and save your analysts and data scientists time.

Manage data where it lives

SAS lets us run the same functions and use the same skill sets and technologies across multiple environments via a portable execution engine called SAS Embedded Process. Think of this as the secret sauce that allows us to run the same code and data quality functions, whether that involves guessing gender, standardizing a state code or merging two tables in-memory, in-stream, in-Hadoop or in-database.

Here's an example: A new merge function takes advantage of the parallel processing enabled by SAS Embedded Process. The merge directive can execute additional logic beyond a standard SQL join and pushes down processing of almost 300 functions to Hadoop for improved performance. We've expanded distribution support to Pivotal HD and IBM Big Insights in addition to Cloudera, Hortonworks and MapR. Finally, we're releasing a free trial that runs on production Hortonworks and Cloudera Hadoop clusters and converts to a production license without the need to reinstall.

The next time you hear an off-key karaoke singer – or work with a delayed data analyst taking too long to give you an accurate report – don't hold it against them. They probably just need better gear.

Watch a demo to learn more about SAS Data Loader for Hadoop.


About Author

Matthew Magne

Principal Product Marketing Manager

@bigdatamagnet - Matthew is a TEDx speaker, musician, and Catan player. He is currently the Global Product Marketing Manager for SAS Data Management focusing on Big Data, Master Data Management, Data Quality, Data Integration and Data Governance. Previously, Matthew was an Information Management Solutions Architect at SAS, worked as a Certified Data Management Consulting IT Professional at IBM, and is a recovering software engineer and entrepreneur. Mr. Magne received his BS, cum laude, in Computer Engineering at Boston University, has done graduate work in Object Oriented Development and completed his MBA at UNCW.

Leave A Reply

Back to Top