Hadoop and big data management: How does it fit in the enterprise?

The other day, I was looking at an enterprise architecture diagram, and it actually showed a connection between the marketing database, the Hadoop server and the data warehouse.  My response can be summed up in two ways. First, I was amazed! Second, I was very interested on how this customer uses Hadoop.  Read More »

Post a Comment

Who owns big data?

Data ownership has always been a thorny issue, but the era of big data is sprouting bigger thorns. Last century, ownership was like the data equivalent of “you break it, you buy it.” If you own data, you are responsible for it, and can be held accountable if something goes wrong with it (e.g., data quality issues). This meant data ownership usually fell into the bottomless well of the business versus IT debate, which came down to arguing over whether data ownership is equivalent to business process ownership or database management.

But then the big rallying cry of this century became: Data is a corporate asset collectively owned by the entire enterprise. In data-driven enterprises, everyone, regardless of their primary role or job function, must accept a shared responsibility for preventing data quality lapses, and for responding appropriately to mitigate the associated business risks when issues do occur. All the while, individuals must still be held accountable for the business process, database management, data stewardship, and many other data-related tasks within the organization. Read More »

Post a Comment

SAS MDM new release brings harmony to big data discord

I've been in many bands over the years- from rock to jazz to orchestra - and each brings with it a different maturity, skill level, attitude, and challenge. Rock is arguably the easiest (and the most fun!) to play, as it involves the least members, lowest skill level, a goodly amount of drama, and the least political atmosphere. Moving up to jazz and Matt Magne rocksorchestra adds more members and takes more skill, maturity, discipline. Unfortunately, it also adds politics and bureaucracy – and as the pic shows, the tendency to make a strange face when you're rockin' out.

Organizations can also be seen in this way. It's great to be on a agile, rock-star team with less politics, but unfortunately, that's not always the case. Sometimes, you're playing first chair trombone in the orchestra, the second chair french horn is playing in a different key from the wrong songbook, and the strings don't want to rehearse on Saturdays.

What's that lead to? Cacophany. Discord. What's a conductor to do? Read More »

Post a Comment

EMC and SAS redefine big data analytics with the data lake

Adoption of Hadoop, a low-cost open source platform used for processing and storing massive amounts of data, has exploded by almost 60 percent in the last two years alone according to Gartner. One primary use case for Hadoop is as a data lake – a vast store of raw, minimally processed data.

But, in many ways, because of the perceived lack of governance and security, it's still like the wild west in the Hadoop world, where gunslingers and tumbleweeds abound. In fact, the same Gartner report identified the top challenges to big data adoption as:

  • Deriving business value.
  • Security and governance.
  • Data integration.
  • Skills.
  • Integrating with existing infrastructure.

Consequently, customers struggle with three main issues: 1) How to start their big data initiatives, 2) How to build out the infrastructure, and 3) How to run, manage, and scale out their solution? Read More »

Post a Comment

Stability and predictability: The alternative selling points for your data quality vision?

One thing that always puzzled me when starting out with data quality management was just how difficult it was to obtain management buy-in. I've spoken before on this blog of the times I've witnessed considerable financial losses attributed to poor quality met with a shrug of management shoulders in terms of action.

So how can you sell data quality to senior management?

The typical approach is to focus on benefits such as cost reduction, profit increase, faster projects, regulatory compliance, customer satisfaction and various operational efficiencies. But what if none of these are motivating your management team? What else can you try? What else motivates the average manager? Read More »

Post a Comment

Finding the signal in the analytics noise

Are you often confused about what people mean by when they talk about analytics?

Me too.

Let me state the obvious: analytics is a catchall, and a trendy one at that. There's hardly a standard definition of it. Adding to the confusion in many circles, the term big data analytics has entered the business vernacular over the past five years. Are the big ones different than their smaller or regular counterparts? If so, how?

Read More »

Post a Comment

Provisioning data for advanced analytics in Hadoop

The data lake is a great place to take a swim, but is the water clean? My colleague, Matthew Magne, compared big data to the Fire Swamp from The Princess Bride, and it can seem that foreboding.

The questions we need to ask are: How was the data transformed and cleansed prior to reaching Hadoop? If multiple data sources are being loaded, are the business keys or surrogate keys aligned across data sets? How will dirty data affect the analytics?

As an ETL/ELT developer, I once spent a great deal of time building and scheduling jobs to ensure a clean source of data for our users. Now, with big data, the sources are more varied, the speed at which data enters the enterprise is faster and the volumes are bigger than ever.

Naturally, when I was on the ETL/ELT side of things, much of my time was divided between building data flows to support the data warehouse and building jobs to create analytic base tables. What if there were a better way to enable the business user? What if we could free up IT to focus on the care and feeding of the Hadoop environment? Read More »

Post a Comment

Using Hadoop: Emerging options for improved query performance

In my last two posts, we concluded two things. First, because of the need for broadcasting data across the internal network to enable the complete execution of a JOIN query in Hadoop, there is a potential for performance degradation for JOINs on top of files distributed using HDFS. Second, there are some techniques that can be hand-engineered (such as replicating tables or breaking your queries into Semi-JOINs first) that can alleviate some of the performance bottlenecks.

The tight-coupling of task and resource management and the MapReduce execution model in Hadoop 1.0 imposes some constraints on enabling some of these optimizations to be introduced automatically within the execution model. However, the execution model, resource management, and task management in YARN (Hadoop 2.0) are all decoupled, which allows developers to devise components that are much better at both recognizing opportunities for optimization and orchestrating the tasks to achieve those optimizations. Read More »

Post a Comment

Showing the ugly face of bad data: Part 2

In my previous post, I talked about how a bank realized that data quality was central to some very basic elements of its initiatives, such as know your customer (KYC), customer on-boarding and others. In this blog, let’s explore what this organization did to foster an environment of data quality throughout its customer data.

The Process

In the initial scope, we outlined three different types of profiles – identity, demographics and communication. We decided to educate the business users and IT on the data quality assessment approach so that they feel comfortable in repeating the process themselves. Success of data quality, master data management (MDM) and data governance initiatives largely depends on how quickly organizations feel comfortable in embracing these ideas and owning the responsibility of executing them.

As a result, we made sure everyone involved in the project understood the high level functional capabilities and the ease of use of SAS data management solution. Data management projects often suffer from a lack of buy-in because end users find these tools far too complex. Without making data management and quality processes business-friendly, the evolution of keeping data consistent for business purposes stops after taking first few steps. Read More »

Post a Comment

The impact of data quality reach

One of the common traps I see data quality analysts falling into is measuring data quality in a uniform way across the entire data landscape.

For example, you may have a transactional dataset that has hundreds of records with missing values or badly entered formats. In contrast, you may have a master dataset, such as equipment master or location master, with only a handful of poor quality records.

When using standard data profiling metrics, it is easy to assume that the transactional dataset performs the worst as it contains the highest volume of errors. In fact, the master dataset could have a far greater negative impact on the organisation because of its "data quality reach." Read More »

Post a Comment