Data quality to "DI" for

There is a time and a place for everything, but the time and place for data quality (DQ) in data integration (DI) efforts always seems like a thing everyone’s not quite sure about. I have previously blogged about the dangers of waiting until the middle of DI to consider, or become forced to consider, DQ. In hindsight, telling yourself you should have had forethought about DQ amounts to little more than someone telling you “I told you so.” Of course, many a DQ professional, myself included, has told you so. But I digress. Read More »

Post a Comment

Data integration: Comparing traditional sources and big data

While not on the same level of Rush, I do fancy myself a fan of The Who. I'm particularly fond of the band's 1973 epic, Quadrophenia. From the track "5:15":

Inside outside, leave me alone
Inside outside, nowhere is home
Inside outside, where have I been?

The inside-outside distinction is rather apropos when thinking about traditional data integration versus its newer, bigger, more dynamic counterpart.

Read More »

Post a Comment

Data integration considerations for the data lake: Standardization and transformation

In my last post, I noted that the flexibility provided by the concept of the schema-on-read paradigm that is typical of a data lake had to be tempered with the use of a metadata repository so that anyone wanting to use that data could figure out what was really in it. There are a few other implications of the integration of data using the schema-on-read approach. Read More »

Post a Comment

Data governance and analytics

The intersection of data governance and analytics doesn’t seem to get discussed as often as its intersection with data management, where data governance provides the guiding principles and context-specific policies that frame the processes and procedures of data management. The reason for this is not, as some may want to believe, that governing analytics is akin to herding cats. Data governance does, in fact, intersect all types of analytics, from the descriptive analytics helping the organization understand what has happened and what is happening now, to the predictive analytics that determines the probability of what will happen next, and the prescriptive analytics that focuses on finding the best course of action for predicted future scenarios. Read More »

Post a Comment

Big data integration: The case against an "all-in" approach

I've spent a great deal of time in my consulting career railing against multiple systems of record, data silos and disparate versions of the truth. In the mid-1990s, I realized that Excel could only do so much. To quickly identify and ultimately ameliorate thorny data issues, I had to up my game. I became proficient at SQL, Microsoft Access, Crystal Reports and other reporting tools precisely because much of my clients' data was incredibly messy. If I could solve these problems, then I could keep myself billable.

I'll still argue for the benefits of single data sources (where possible) until I'm blue in the face. Here, I'm talking about small data controlled by the enterprise. But what about integration with big data? Does a single data repository make sense? Should an organization go all-in? Read More »

Post a Comment

Data integration considerations for the data lake: The need for metadata

A few of our clients are exploring the use of a data lake as both a landing pad and a repository for collection of enterprise data sets. However, after probing a little bit about what they expected to do with this data lake, I found that the simple use of the data lake appellation masks out a large part of the complexity of change required to transition from the conventional approach of data acquisition and storage to the data lake concept. Read More »

Post a Comment

Justifying the need for a data governance business case

Many people who plan data governance initiatives ignore the need for a business case.

"We've already had approval for the project; why do we need a business case when we've got the budget signed off?"

The perception is that because they have a strong commitment, there is no need to get bogged down in the process of justification. Read More »

Post a Comment

What is big data integration?

It's an important and profound question, one that executives will increasingly be asking themselves in the coming years. What's more, much like the definition of big data, it's folly to think of "one" or "the right" definition of the term "big data integration."

Here's mine.

Read More »

Post a Comment

Data quality on Hadoop: The easy way

Bigger doesn’t always mean better. And that’s often the case with big data. Your data quality (DQ) problem – no denial, please – often only magnifies when you get bigger data sets.

Having more unstructured data adds another level of complexity. The need for data quality on Hadoop is shown by user feedback in the latest TDWI Best Practices Report "Hadoop for the Enterprise." 55% of the respondents plan to integrate DQ in the next three years. So how to take care of big data quality?

Well, there is good news. You don’t have to learn Java and implement complex MapReduce code to fix quality issues in the data within a Hadoop cluster. SAS Data Loader for Hadoop comes with data quality directives that help business users detect and repair data problems quickly and easily.

Data Loader Directives-DQ

SAS Data Loader for Hadoop including the data quality directives

Read More »

Post a Comment