David Loshin says simple approaches to identity resolution may not scale on a big data platform as data volumes increase.
David Loshin explains 4 struggles of syndicating master data across the enterprise.
David Loshin explains why MDM is such a valuable tool in helping to detect fraud.
David Loshin extends his exploration of ethical issues surrounding automated systems and event stream processing to encompass data quality and risk considerations.
David Loshin describes three sets of policies required for ensuring compliance with data protection directives for health care.
Health care fraud prevention is a sticky topic. David Loshin discusses what's needed to balance prompt claims payments with fraud prevention efforts.
I've been working on a pilot project recently with a client to test out some new NoSQL database frameworks (graph databases in particular). Our goal is to see how a different storage model, representation and presentation can enhance the usability and ease of integration for master data indexes and entity
As the application stack supporting big data has matured, it has demonstrated the feasibility of ingesting, persisting and analyzing potentially massive data sets that originate both within and outside of conventional enterprise boundaries. But what does this mean from a data governance perspective?
In my last post we started looking at the issue of identifier proliferation, in which different business applications assigned their own unique identifiers to data representing the same entities. Even master data management (MDM) applications are not immune to this issue, particularly because of the inherent semantics associated with the
I was surprised to learn recently that despite the reams of laws and policies directing the protection of personally identifiable information (PII) across industries and government agencies, more than 50 million Medicare beneficiaries were issued cards with a Medicare Beneficiary Number that's based on their Social Security Number (SSN). That's
One aspect of high-quality information is consistency. We often think about consistency in terms of consistent values. A large portion of the effort expended on “data quality dimensions” essentially focuses on data value consistency. For example, when we describe accuracy, what we often mean is consistency with a defined source
We often talk about full customer data visibility and the need for a “golden record” that provides a 360-degree view of the customer to enhance our customer-facing processes. The rationale is that by accumulating all the data about a customer (or, for that matter, any entity of interest) from multiple sources, you
In my prior posts about operational data governance, I've suggested the need to embed data validation as an integral component of any data integration application. In my last post, we looked at an example of using a data quality audit report to ensure fidelity of the data integration processes for
Data integration teams often find themselves in the middle of discussions where the quality of their data outputs are called into question. Without proper governance procedures in place, though, it's hard to address these accusations in a reasonable way. Here's why.
In my last post, we explored the operational facet of data governance and data stewardship. We focused on the challenges of providing a scalable way to assess incoming data sources, identify data quality rules and define enforceable data quality policies. As the number of acquired data sources increases, it becomes
Data governance can encompass a wide spectrum of practices, many of which are focused on the development, documentation, approval and deployment of policies associated with data management and utilization. I distinguish the facet of “operational” data governance from the fully encompassed practice to specifically focus on the operational tasks for
What does it really mean when we talk about the concept of a data asset? For the purposes of this discussion, let's say that a data asset is a manifestation of information that can be monetized. In my last post we explored how bringing many data artifacts together in a
A long time ago, I worked for a company that had positioned itself as basically a third-party “data trust” to perform collaborative analytics. The business proposition was to engage different types of organizations whose customer bases overlapped, ingest their data sets, and perform a number of analyses using the accumulated
In my last post, I started to look at the use of Hadoop in general and the data lake concept in particular as part of a plan for modernizing the data environment. There are surely benefits to the data lake, especially when it's deployed using a low-cost, scalable hardware platform.
More and more organizations are considering the use of maturing scalable computing environments like Hadoop as part of their enterprise data management, processing and analytics infrastructure. But there's a significant difference between the evaluation phase of technology adoption and its subsequent production phase. This seems apparent in terms of how organizations are
In my last post, I discussed the issue of temporal inconsistency for master data, when the records in the master repository are inconsistent with the source systems as a result of a time-based absence of synchronization. Periodic master data updates that pull data from systems without considering alignment with in-process
Master data management (MDM) provides methods for unifying data about important entities (such as “customer” or “product”) that are managed within independent systems. In most cases, there is some kind of customer data integration requirement for downstream reporting, and analysis for some specific business objective – such as customer profiling for
In my last post we started to look at two different Internet of Things (IoT) paradigms. The first only involved streaming automatically generated data from machines (such as sensor data). The second combined human-generated and machine-generated data, such as social media updates that are automatically augmented with geo-tag data by
The concept of the internet of things (IoT) is used broadly to cover any organization of communication devices and methods, messages streaming from the device pool, data collected at a centralized point, and analysis used to exploit the combined data for business value. But this description hides the richness of
At a recent TDWI conference, I was strolling the exhibition floor when I noticed an interesting phenomenon. A surprising percentage of the exhibiting vendors fell into one of two product categories. One group was selling cloud-based or hosted data warehousing and/or analytics services. The other group was selling data integration products. Of
I've been doing some investigation into Apache Spark, and I'm particularly intrigued by the concept of the resilient distributed dataset, or RDD. According to the Apache Spark website, an RDD is “a fault-tolerant collection of elements that can be operated on in parallel.” Two aspects of the RDD are particularly
In my two prior posts, I discussed the process of developing a business justification for a data strategy and for assessing an organization's level of maturity with key data management processes and operational procedures. The business justification phase can be used to speculate about the future state of data management required