As the application stack supporting big data has matured, it has demonstrated the feasibility of ingesting, persisting and analyzing potentially massive data sets that originate both within and outside of conventional enterprise boundaries. But what does this mean from a data governance perspective?
Tag: data management for analytics
One aspect of high-quality information is consistency. We often think about consistency in terms of consistent values. A large portion of the effort expended on “data quality dimensions” essentially focuses on data value consistency. For example, when we describe accuracy, what we often mean is consistency with a defined source
.@philsimon on the need to adopt agile methodologies for data prep and analytics.
In Part 1 of this two-part series, I defined data preparation and data wrangling, then raised some questions about requirements gathering in a governed environment (i.e., ODS and/or data warehouse). Now – all of us very-managed people are looking at the horizon, and we see the data lake. How do
Lately I've been binge-watching a lot of police procedural television shows. The standard format for almost every episode is the same. It starts with the commission or discovery of a crime, followed by forensic investigation of the crime scene, analysis of the collected evidence, and interviews or interrogations with potential suspects. It ends
.@philsimon chimes in on new data-gathering methods and what they mean for analytics.
I'm a very fortunate woman. I have the privilege of working with some of the brightest people in the industry. But when it comes to data, everyone takes sides. Do you “govern” the use of all data, or do you let the analysts do what they want with the data to
Critical business applications depend on the enterprise creating and maintaining high-quality data. So, whenever new data is received – especially from a new source – it’s great when that source can provide data without defects or other data quality issues. The recent rise in self-service data preparation options has definitely improved the quality of
Hadoop has driven an enormous amount of data analytics activity lately. And this poses a problem for many practitioners coming from the traditional relational database management system (RDBMS) world. Hadoop is well known for having lots of variety in the structure of data it stores and processes. But it's fair to
In my last post, I talked about how data still needs to be cleaned up – and data strategy still needs to be re-evaluated – as we start to work with nontraditional databases and other new technologies. There are lots of ways to use these new platforms (like Hadoop). For example, many
I'm hard-pressed to think of a trendier yet more amorphous term today than analytics. It seems that every organization wants to take advantage of analytics, but few really are doing that – at least to the extent possible. This topic interests me quite a bit, and I hope to explore
What does it really mean when we talk about the concept of a data asset? For the purposes of this discussion, let's say that a data asset is a manifestation of information that can be monetized. In my last post we explored how bringing many data artifacts together in a
If your enterprise is working with Hadoop, MongoDB or other nontraditional databases, then you need to evaluate your data strategy. A data strategy must adapt to current data trends based on business requirements. So am I still the clean-up woman? The answer is YES! I still work on the quality of the data.
The demand for data preparation solutions is at an all-time high, and it's primarily driven by the demand for self-service analytics. Ten years ago, if you were a business leader that wanted to get more in-depth information on a particular KPI, you would typically issue a reporting request to IT
Data access and data privacy are often fundamentally at odds with each other. Organizations want unfettered access to the data describing customers. Meanwhile, customers want their data – especially their personally identifiable information – to remain as private as possible. Organizations need to protect data privacy by only granting data access to authorized
A long time ago, I worked for a company that had positioned itself as basically a third-party “data trust” to perform collaborative analytics. The business proposition was to engage different types of organizations whose customer bases overlapped, ingest their data sets, and perform a number of analyses using the accumulated
In my previous post I discussed the practice of putting data quality processes as close to data sources as possible. Historically this meant data quality happened during data integration in preparation for loading quality data into an enterprise data warehouse (EDW) or a master data management (MDM) hub. Nowadays, however, there’s a lot of
Throughout my long career of building and implementing data quality processes, I've consistently been told that data quality could not be implemented within data sources, because doing so would disrupt production systems. Therefore, source data was often copied to a central location – a staging area – where it was cleansed, transformed, unduplicated, restructured
A soccer fairy tale Imagine it's Soccer Saturday. You've got 10 kids and 10 loads of laundry – along with buried soccer jerseys – that you need to clean before the games begin. Oh, and you have two hours to do this. Fear not! You are a member of an advanced HOA
Traditional data governance is all about establishing a boundary around a specific data domain. This translates to establishing authority to define key business terms within that domain; establishing business-driven decision making processes for changing the business terminology and the rules that apply to them; defining content standards (e.g., metadata and
(Otherwise known as Truncate – Load – Analyze – Repeat!) After you’ve prepared data for analysis and then analyzed it, how do you complete this process again? And again? And again? Most analytical applications are created to truncate the prior data, load new data for analysis, analyze it and repeat
Once you have assessed the types of reporting and analytics projects and activities are to be done by the community of data analysts and consumers and have assessed their business needs and requirements for performance, you can then evaluate – with confidence – how different platforms and tools can be combined to satisfy
In my previous post I used junk drawers as an example of the downside of including more data in our analytics just in case it helps us discover more insights only to end up with more flotsam than findings. In this post I want to float some thoughts about a two-word concept
In April, the free trial of SAS Data Loader for Hadoop became available globally. Now, you can take a test drive of our new technology designed to increase the speed and ease of managing data within Hadoop. The downloads might take a while (after all, this is big data), but I think you’ll
In the last post, we talked about creating the requirements for the data analytics, and profiling the data prior to load. Now, let’s consider how to filter, format and deliver that data to the analytics application. Filter – the act of selecting the data of interest to be used in the
In the era of big data, we collect, prepare, manage, and analyze a lot of data that is supposed to provide us with a better picture of our customers, partners, products, and services. These vast data murals are impressive to behold, but in painting such a broad canvas, these pictures
One area that often gets overlooked when building out a new data analytics solution is the importance of ensuring accurate and robust data definitions. This is one of those issues that is difficult to detect because unlike a data quality defect, there are no alarms or reports to indicate a