SAS MDM new release brings harmony to big data discord

I've been in many bands over the years- from rock to jazz to orchestra - and each brings with it a different maturity, skill level, attitude, and challenge. Rock is arguably the easiest (and the most fun!) to play, as it involves the least members, lowest skill level, a goodly amount of drama, and the least political atmosphere. Moving up to jazz and Matt Magne rocksorchestra adds more members and takes more skill, maturity, discipline. Unfortunately, it also adds politics and bureaucracy – and as the pic shows, the tendency to make a strange face when you're rockin' out.

Organizations can also be seen in this way. It's great to be on a agile, rock-star team with less politics, but unfortunately, that's not always the case. Sometimes, you're playing first chair trombone in the orchestra, the second chair french horn is playing in a different key from the wrong songbook, and the strings don't want to rehearse on Saturdays.

What's that lead to? Cacophany. Discord. What's a conductor to do? Read More »

Post a Comment

EMC and SAS redefine big data analytics with the data lake

Adoption of Hadoop, a low-cost open source platform used for processing and storing massive amounts of data, has exploded by almost 60 percent in the last two years alone according to Gartner. One primary use case for Hadoop is as a data lake – a vast store of raw, minimally processed data.

But, in many ways, because of the perceived lack of governance and security, it's still like the wild west in the Hadoop world, where gunslingers and tumbleweeds abound. In fact, the same Gartner report identified the top challenges to big data adoption as:

  • Deriving business value.
  • Security and governance.
  • Data integration.
  • Skills.
  • Integrating with existing infrastructure.

Consequently, customers struggle with three main issues: 1) How to start their big data initiatives, 2) How to build out the infrastructure, and 3) How to run, manage, and scale out their solution? Read More »

Post a Comment

Stability and predictability: The alternative selling points for your data quality vision?

One thing that always puzzled me when starting out with data quality management was just how difficult it was to obtain management buy-in. I've spoken before on this blog of the times I've witnessed considerable financial losses attributed to poor quality met with a shrug of management shoulders in terms of action.

So how can you sell data quality to senior management?

The typical approach is to focus on benefits such as cost reduction, profit increase, faster projects, regulatory compliance, customer satisfaction and various operational efficiencies. But what if none of these are motivating your management team? What else can you try? What else motivates the average manager? Read More »

Post a Comment

Finding the signal in the analytics noise

Are you often confused about what people mean by when they talk about analytics?

Me too.

Let me state the obvious: analytics is a catchall, and a trendy one at that. There's hardly a standard definition of it. Adding to the confusion in many circles, the term big data analytics has entered the business vernacular over the past five years. Are the big ones different than their smaller or regular counterparts? If so, how?

Read More »

Post a Comment

Provisioning data for advanced analytics in Hadoop

The data lake is a great place to take a swim, but is the water clean? My colleague, Matthew Magne, compared big data to the Fire Swamp from The Princess Bride, and it can seem that foreboding.

The questions we need to ask are: How was the data transformed and cleansed prior to reaching Hadoop? If multiple data sources are being loaded, are the business keys or surrogate keys aligned across data sets? How will dirty data affect the analytics?

As an ETL/ELT developer, I once spent a great deal of time building and scheduling jobs to ensure a clean source of data for our users. Now, with big data, the sources are more varied, the speed at which data enters the enterprise is faster and the volumes are bigger than ever.

Naturally, when I was on the ETL/ELT side of things, much of my time was divided between building data flows to support the data warehouse and building jobs to create analytic base tables. What if there were a better way to enable the business user? What if we could free up IT to focus on the care and feeding of the Hadoop environment? Read More »

Post a Comment

Using Hadoop: Emerging options for improved query performance

In my last two posts, we concluded two things. First, because of the need for broadcasting data across the internal network to enable the complete execution of a JOIN query in Hadoop, there is a potential for performance degradation for JOINs on top of files distributed using HDFS. Second, there are some techniques that can be hand-engineered (such as replicating tables or breaking your queries into Semi-JOINs first) that can alleviate some of the performance bottlenecks.

The tight-coupling of task and resource management and the MapReduce execution model in Hadoop 1.0 imposes some constraints on enabling some of these optimizations to be introduced automatically within the execution model. However, the execution model, resource management, and task management in YARN (Hadoop 2.0) are all decoupled, which allows developers to devise components that are much better at both recognizing opportunities for optimization and orchestrating the tasks to achieve those optimizations. Read More »

Post a Comment

Showing the ugly face of bad data: Part 2

In my previous post, I talked about how a bank realized that data quality was central to some very basic elements of its initiatives, such as know your customer (KYC), customer on-boarding and others. In this blog, let’s explore what this organization did to foster an environment of data quality throughout its customer data.

The Process

In the initial scope, we outlined three different types of profiles – identity, demographics and communication. We decided to educate the business users and IT on the data quality assessment approach so that they feel comfortable in repeating the process themselves. Success of data quality, master data management (MDM) and data governance initiatives largely depends on how quickly organizations feel comfortable in embracing these ideas and owning the responsibility of executing them.

As a result, we made sure everyone involved in the project understood the high level functional capabilities and the ease of use of SAS data management solution. Data management projects often suffer from a lack of buy-in because end users find these tools far too complex. Without making data management and quality processes business-friendly, the evolution of keeping data consistent for business purposes stops after taking first few steps. Read More »

Post a Comment

The impact of data quality reach

One of the common traps I see data quality analysts falling into is measuring data quality in a uniform way across the entire data landscape.

For example, you may have a transactional dataset that has hundreds of records with missing values or badly entered formats. In contrast, you may have a master dataset, such as equipment master or location master, with only a handful of poor quality records.

When using standard data profiling metrics, it is easy to assume that the transactional dataset performs the worst as it contains the highest volume of errors. In fact, the master dataset could have a far greater negative impact on the organisation because of its "data quality reach." Read More »

Post a Comment

SAS Data Loader for Hadoop helps your data heroes navigate the fire swamp of big data

In The Princess Bride, one of my favorite movies, our hero Westley – in an attempt to save his love, Buttercup – has to navigate the Fire Swamp. There, Westley and Buttercup encounter fire spouts, quicksand and the dreaded rodents of unusual size (RUS's). Each time he has a response to the threats and is able to get them through to the other side.

Just like Westley and Buttercup, our data heroes, which include the business analysts and data scientists, must brave the Data Swamp. What's a Data Swamp? It's a junkyard of minimally processed, dirty data that our Data Lakes, once filled with so much promise, can easily become. Read More »

Post a Comment

Using Hadoop: Query optimization

In my last post, I pointed out that an uninformed approach to running queries on top of data stored in Hadoop HDFS may lead to unexpected performance degradation for reporting and analysis. The key issue had to do with JOINs in which all the records in one data set needed to be compared with all the records in a second data set. The need to look at the cross-product of the two data sets lead to a data latency nightmare, requiring broadcasts of each data chunk to all the computing nodes that flood the network. Read More »

Post a Comment