Many people perceive big data management technologies as a “cure-all” for their analytics needs. But I would be surprised if any organization that has invested in developing a conventional data warehouse – even on a small scale – would completely rip that data warehouse out and immediately replace it with an NoSQL
Author
In my prior two posts, I explored some of the issues associated with data integration for big data and particularly, the conceptual data lake in which source data sets are accumulated and stored, awaiting access from interested data consumers. One of the distinctive features of this approach is the transition
In my last post, I noted that the flexibility provided by the concept of the schema-on-read paradigm that is typical of a data lake had to be tempered with the use of a metadata repository so that anyone wanting to use that data could figure out what was really in
A few of our clients are exploring the use of a data lake as both a landing pad and a repository for collection of enterprise data sets. However, after probing a little bit about what they expected to do with this data lake, I found that the simple use of
Operationalizing data governance means putting processes and tools in place for defining, enforcing and reporting on compliance with data quality and validation standards. There is a life cycle associated with a data policy, which is typically motivated by an externally mandated business policy or expectation, such as regulatory compliance.
In recent years, we practitioners in the data management world have been pretty quick to conflate “data governance” with “data quality” and “metadata.” Many tools marketed under "data governance" have emerged – yet when you inspect their capabilities, you see that in many ways these tools largely encompass data validation and data standardization. Unfortunately, we
In my last two posts, I introduced some opportunities that arise from integrating event stream processing (ESP) within the nodes of a distributed network. We considered one type of deployment that includes the emergent Internet of Things (IoT) model in which there are numerous end nodes that monitor a set of sensors,
In my last post, we examined the growing importance of event stream processing to predictive and prescriptive analytics. In the example we discussed, we looked at how all the event streams from point-of-sale systems from multiple retail locations are absorbed at a centralized point for analysis. Yet the beneficiaries of those
Over the past year and a half, there has been a subtle shift in media attention from big data analytics to what is referred to as the Internet of Things, or IoT for short. The shift in focus is not intended to diminish the value of big data platforms and
Once you have assessed the types of reporting and analytics projects and activities are to be done by the community of data analysts and consumers and have assessed their business needs and requirements for performance, you can then evaluate – with confidence – how different platforms and tools can be combined to satisfy
In the last few days, I have heard the term “data lake” bandied about in various client conversations. As with all buzz-term simplifications, the concept of a “data lake” seems appealing, particularly when it is implied to mean “a framework enabling general data accessibility for enterprise information assets.” And of
As part of two of our client engagements, we have been tasked with providing guidance on an analytics environment platform strategy. More concretely, the goal is to assess the systems that currently compose the “data warehouse environment” and determine what the considerations are for determining the optimal platforms to support
In my last two posts, we concluded two things. First, because of the need for broadcasting data across the internal network to enable the complete execution of a JOIN query in Hadoop, there is a potential for performance degradation for JOINs on top of files distributed using HDFS. Second, there are
In my last post, I pointed out that an uninformed approach to running queries on top of data stored in Hadoop HDFS may lead to unexpected performance degradation for reporting and analysis. The key issue had to do with JOINs in which all the records in one data set needed
Hadoop is increasingly being adopted as the go-to platform for large-scale data analytics. However, it is still not necessarily clear that Hadoop is always the optimal choice for traditional data warehousing for reporting and analysis, especially in its “out of the box” configuration. That is because Hadoop itself is not
Over my last two posts, I suggested that our expectations for data quality morph over the duration of business processes, and it is only at a point that the process has completed that we can demand that all statically-applied data quality rules be observed. However, over the duration of the
In my last post, I pointed out that we data quality practitioners want to apply data quality assertions to data instances to validate data in process, but the dynamic nature of data must be contrasted with our assumptions about how quality measures are applied to static records. In practice, the
After working in the data quality industry for a number of years, I have realized that most practitioners tend to have a rather rigid perception of the assertions about the quality of data. Either a data set conforms to the set of data quality criteria and is deemed to be acceptable
With our recent client engagements in which the organization is implementing one or more master data management (MDM) projects, I have been advocating that a task to design a demonstration application be added to the early part of the project plan. Many early MDM implementers seem to have taken the
In the last post we looked at the use case for master data in which the consuming application expected a single unique representative record for each unique entity. This would be valuable in situations for batch accesses like SQL queries where aggregates are associated with one and only one entity record.
Last time I suggested that there are some typical use cases for master data, and this week we will examine the desire for accessibility to a presumed “golden” record that represents “the single source of truth” for a specific entity. I put both of those terms in quotes because I
I have probably touched on this topic many times before: accessing the data that has been loaded into a master data environment. In recent weeks some client experiences are really highlighting something that is increasingly apparent (and should be obvious) for master data management: the need to demonstrate that it
A few weeks back I noted that one of the objectives on an inventory process for reference data was data harmonization, which meant determining when two reference sets refer to the same conceptual domain and harmonizing the contents into a conformed standard domain. Conceptually it sounds relatively straightforward, but as
In my last set of posts I started to look at some of the challenges associated with enterprise management of reference data domains, especially as the scope of use for the same conceptual reference domains expands across databases, systems, and functional areas within the organizations. Recognizing the value of capturing
David Loshin defines reference data and sets up a working definition for his next set of posts.
A few years ago, I was presenting a morning course on master data management in which I shared some thoughts about some of the barriers to success in transitioning the use of a developed master data management index and repository into production systems. During the coffee break, an attendee mentioned
In the past few weeks I have presented training sessions on data governance, master data management, data quality and analytics at three different venues. At each one of these events, during one of the breaks a variety of people in my course noted that the technical concepts of implementing programs
In my last post I introduced the term “behavior architecture,” and this time I would like to explore what that concept means. One approach is to start with the basics: given a business process with a set of decision points and a number of participants, the behavior architecture is the
Instituting an analytics program in which actionable insight is delivered to a business consumer will be successful if those consumers are aware of what they need to do to improve their processes and reap the benefits. As we have explored over the past few posts, success in the use of
The data quality and data governance community has a somewhat disconcerting habit to want to append the word “quality” to every phrase that has the word “data” in it. So it is no surprise that the growing use of the phrase “big data” has been duly followed by claims of