Inside outside, leave me alone
Inside outside, nowhere is home
Inside outside, where have I been?
The inside-outside distinction is rather apropos when thinking about traditional data integration versus its newer, bigger, more dynamic counterpart.
In my last post, I noted that the flexibility provided by the concept of the schema-on-read paradigm that is typical of a data lake had to be tempered with the use of a metadata repository so that anyone wanting to use that data could figure out what was really in it. There are a few other implications of the integration of data using the schema-on-read approach. Read More
The intersection of data governance and analytics doesn’t seem to get discussed as often as its intersection with data management, where data governance provides the guiding principles and context-specific policies that frame the processes and procedures of data management. The reason for this is not, as some may want to believe, that governing analytics is akin to herding cats. Data governance does, in fact, intersect all types of analytics, from the descriptive analytics helping the organization understand what has happened and what is happening now, to the predictive analytics that determines the probability of what will happen next, and the prescriptive analytics that focuses on finding the best course of action for predicted future scenarios. Read More
I've spent a great deal of time in my consulting career railing against multiple systems of record, data silos and disparate versions of the truth. In the mid-1990s, I realized that Excel could only do so much. To quickly identify and ultimately ameliorate thorny data issues, I had to up my game. I became proficient at SQL, Microsoft Access, Crystal Reports and other reporting tools precisely because much of my clients' data was incredibly messy. If I could solve these problems, then I could keep myself billable.
I'll still argue for the benefits of single data sources (where possible) until I'm blue in the face. Here, I'm talking about small data controlled by the enterprise. But what about integration with big data? Does a single data repository make sense? Should an organization go all-in? Read More
A few of our clients are exploring the use of a data lake as both a landing pad and a repository for collection of enterprise data sets. However, after probing a little bit about what they expected to do with this data lake, I found that the simple use of the data lake appellation masks out a large part of the complexity of change required to transition from the conventional approach of data acquisition and storage to the data lake concept. Read More
Many people who plan data governance initiatives ignore the need for a business case.
"We've already had approval for the project; why do we need a business case when we've got the budget signed off?"
The perception is that because they have a strong commitment, there is no need to get bogged down in the process of justification. Read More
It's an important and profound question, one that executives will increasingly be asking themselves in the coming years. What's more, much like the definition of big data, it's folly to think of "one" or "the right" definition of the term "big data integration."
Bigger doesn’t always mean better. And that’s often the case with big data. Your data quality (DQ) problem – no denial, please – often only magnifies when you get bigger data sets.
Having more unstructured data adds another level of complexity. The need for data quality on Hadoop is shown by user feedback in the latest TDWI Best Practices Report "Hadoop for the Enterprise." 55% of the respondents plan to integrate DQ in the next three years. So how to take care of big data quality?
Well, there is good news. You don’t have to learn Java and implement complex MapReduce code to fix quality issues in the data within a Hadoop cluster. SAS Data Loader for Hadoop comes with data quality directives that help business users detect and repair data problems quickly and easily.
Yes. For those keeping score at home, this is my second post in a row starting with a one-word answer to its questioning title. In this case, it’s a question that’s asked a lot and for good reason since big data raises big questions for all data-related disciplines. Read More