Business needs and performance expectations: Data management for analytics

In the last few days, I have heard the term “data lake” bandied about in various client conversations. As with all buzz-term simplifications, the concept of a “data lake” seems appealing, particularly when it is implied to mean “a framework enabling general data accessibility for enterprise information assets.” And of course, as with all simplifications, the data lake comes bundled with presumptions, mostly centered on the deployment platform (that is: Hadoop and HDFS) that somewhat ignore two key things: who is the intended audience for data accessibility, and what do they want to accomplish by having access to that data.

I’d suggest that the usage scenario is obviously centered on reporting and analytics; the processes access the data used for transaction processing or operational systems would already be engineered to the application owners’ specifications. Rather, the data lake concept is for empowering analysts to bypass IT when there are no compelling reasons to have to endure the bottlenecks and bureaucracy that have become the IT division’s typical modus operandi. Read More »

Post a Comment

Hadoop is not Beetlejuice

In the 1988 film Beetlejuice, the title character, hilariously portrayed by Michael Keaton, is a bio exorcist (a ghost capable of scaring the living) hired by a recently deceased couple in an attempt to scare off the new owners of their house. Beetlejuice is summoned by saying his name three times. (Beetlejuice. Beetlejuice. Beetlejuice.)

Nowadays it seems like every time big data is discussed, Hadoop comes up in the conversation eventually. Need a way to deal with all that unstructured data? Hadoop. Need a way to integrate big data into your other applications? Hadoop. Need a replacement for your clunky old data warehouse? Hadoop.

Hadoop. Hadoop. Hadoop. It’s like saying it three times summons a data exorcist, a supernatural solution to all your big data problems. Unfortunately, neither big data nor Hadoop is that simple. I decided, therefore, to conduct a seance with a few of my fellow Data Roundtablers and exorcise three common misconceptions about Hadoop. Read More »

Post a Comment

Differentiating process, persistence, and publication: Data management for analytics

As part of two of our client engagements, we have been tasked with providing guidance on an analytics environment platform strategy. More concretely, the goal is to assess the systems that currently compose the “data warehouse environment” and determine what the considerations are for determining the optimal platforms to support the objectives of each reporting/analytics system.

Interestingly, yet not surprisingly, the systems are designed to execute data warehousing tasks of varying complexity and to support a community of users whose skills range from loading extracts into excel to make simple bar charts to sophisticated analysts performing data discovery and predictive/prescriptive analytics. Yet the siloed approach to management, design, and development of these systems has led to some unintentional replication.

The challenge is to determine what that replication really is. In some cases there is obvious data replication: a data subset is extracted from a set of files and is then loaded into an auxiliary data mart. In some cases the replication is process replication: an input data source is scanned, parsed, and transformed for loading into one data environment, and later the same data is pulled from the data environment, scanned, parsed, and transformed prior to loading into a data system that lies further downstream. Read More »

Post a Comment

Lineage, data quality and continuity: Keeping your data analytics healthy

The adoption of data analytics in organisations is widespread these days. Due to the lower costs of ownership and increased ease of deployment, there are realistically no barriers for any organisation wishing to exploit more from their data.

This of course presents a challenge because the rate of data analytics adoption has not always been reflected in the rate of sound data management practices when it comes to ensuring quality along the analytics information chain.

In the past, many organisations adopted the data warehouse model for enabling analytics meaning that there was at least some kind of buffer between operational systems and the management reporting layer.

Many organisations choose to ignore the data warehouse approach and suck data straight from live, operational systems, in order to perform regular analytics or perhaps ad hoc analysis of specific issues.

This is where issues can creep in because the temptation to analyse poor quality data is ever present. Read More »

Post a Comment

The need for speed: A look at real-time analytics

You may feel like the world is moving faster than ever. If so, then you can take solace in two facts:

  • You're not alone in feeling this way.
  • You're right. It is.

Celebrating the 25-year anniversary of the Web, The Economist ran a piece examining the increasingly rapid adoption of new technologies. From it:

It took only seven years from the first web pages in 1991 for the web to be used by a quarter of the American population. That compares with 46 years for electricity, 35 years for the phone and 26 years for television. The web, just 25 years old, is still at the start of its life.

Read More »

Post a Comment

Hadoop and big data management: How does it fit in the enterprise?

The other day, I was looking at an enterprise architecture diagram, and it actually showed a connection between the marketing database, the Hadoop server and the data warehouse.  My response can be summed up in two ways. First, I was amazed! Second, I was very interested on how this customer uses Hadoop.  Read More »

Post a Comment

Who owns big data?

Data ownership has always been a thorny issue, but the era of big data is sprouting bigger thorns. Last century, ownership was like the data equivalent of “you break it, you buy it.” If you own data, you are responsible for it, and can be held accountable if something goes wrong with it (e.g., data quality issues). This meant data ownership usually fell into the bottomless well of the business versus IT debate, which came down to arguing over whether data ownership is equivalent to business process ownership or database management.

But then the big rallying cry of this century became: Data is a corporate asset collectively owned by the entire enterprise. In data-driven enterprises, everyone, regardless of their primary role or job function, must accept a shared responsibility for preventing data quality lapses, and for responding appropriately to mitigate the associated business risks when issues do occur. All the while, individuals must still be held accountable for the business process, database management, data stewardship, and many other data-related tasks within the organization. Read More »

Post a Comment

SAS MDM new release brings harmony to big data discord

I've been in many bands over the years- from rock to jazz to orchestra - and each brings with it a different maturity, skill level, attitude, and challenge. Rock is arguably the easiest (and the most fun!) to play, as it involves the least members, lowest skill level, a goodly amount of drama, and the least political atmosphere. Moving up to jazz and Matt Magne rocksorchestra adds more members and takes more skill, maturity, discipline. Unfortunately, it also adds politics and bureaucracy – and as the pic shows, the tendency to make a strange face when you're rockin' out.

Organizations can also be seen in this way. It's great to be on a agile, rock-star team with less politics, but unfortunately, that's not always the case. Sometimes, you're playing first chair trombone in the orchestra, the second chair french horn is playing in a different key from the wrong songbook, and the strings don't want to rehearse on Saturdays.

What's that lead to? Cacophany. Discord. What's a conductor to do? Read More »

Post a Comment

EMC and SAS redefine big data analytics with the data lake

Adoption of Hadoop, a low-cost open source platform used for processing and storing massive amounts of data, has exploded by almost 60 percent in the last two years alone according to Gartner. One primary use case for Hadoop is as a data lake – a vast store of raw, minimally processed data.

But, in many ways, because of the perceived lack of governance and security, it's still like the wild west in the Hadoop world, where gunslingers and tumbleweeds abound. In fact, the same Gartner report identified the top challenges to big data adoption as:

  • Deriving business value.
  • Security and governance.
  • Data integration.
  • Skills.
  • Integrating with existing infrastructure.

Consequently, customers struggle with three main issues: 1) How to start their big data initiatives, 2) How to build out the infrastructure, and 3) How to run, manage, and scale out their solution? Read More »

Post a Comment

Stability and predictability: The alternative selling points for your data quality vision?

One thing that always puzzled me when starting out with data quality management was just how difficult it was to obtain management buy-in. I've spoken before on this blog of the times I've witnessed considerable financial losses attributed to poor quality met with a shrug of management shoulders in terms of action.

So how can you sell data quality to senior management?

The typical approach is to focus on benefits such as cost reduction, profit increase, faster projects, regulatory compliance, customer satisfaction and various operational efficiencies. But what if none of these are motivating your management team? What else can you try? What else motivates the average manager? Read More »

Post a Comment