How to improve your data profiling performance

Data profiling is a core technique of data quality management and often the starting point for so many projects these days. Because it’s such a relatively simple technique to apply, it’s easy to overlook some of the more advanced techniques that can take your data profiling to the next level. I'd like to share some "profiling power tips" to help improve your data profiling skills.

Power tip : Segment your data

Profiling tools give you the ability to check for statistics such as uniqueness, completeness, value distribution, format distribution and a whole host of other metrics.

The problem is that profiling is often carried out on an entire set of data, but this can have the effect of seriously skewing the findings.

For example, in one past assignment we analysed a utilities system that contained millions of equipment records. We were particularly interested in the power rating value and whether it was populated or not.

Profiling this attribute in isolation told us that the organisation had a great deal of work to do in order to improve the power rating value. When we segmented the data by criteria – such as active equipment that wasn’t facing decommission – we realised the picture wasn’t as bleak as the initial profiling stats led us to believe.

The lesson here is that it’s vital to split your records into distinct subsets so you can make greater sense of what is really happening with your data quality. You will often be bombarded with questions by business users and sponsors when you present your profiling findings, so make sure you’ve explored many angles with your analysis. Read More »

Post a Comment

Direct data monetization

With respect to data, there seem to be a few types of companies:

  • Those that do fairly little with the value of their data. I've consulted for quite a few.
  • Those that maximize the value of their data, often controversially. Facebook and Google are squarely in this group.
  • Those that maximize the value of their data behind the scenes. Axciom is perhaps the prime example.

Of course, these are generalizations. Many companies fall somewhere in between. As the data deluge continues, we're beginning to see new types of business-model experimentation emerging, particularly with directly monetizing data. As the Chinese say, in crisis there is opportunity.

This is why I found the recent Propublica announcement so interesting. It has launched a Data Store, much like Apple's AppStore. (Like Angry Birds, maybe you'll be able to buy Angry Data one day?The main difference: you can buy data, not apps. From the site:

In the Data Store, you'll find a growing collection of the data we've used in our reporting. For raw, as-is datasets we receive from government sources, you'll find a free download link that simply requires you agree to a simplified version of our Terms of Use. For datasets that are available as downloads from government websites, we've simply linked to the sites to ensure you can quickly get the most up-to-date data. Read More »

Post a Comment

Innovation needs contamination

In his book Where Good Ideas Come From: The Natural History of Innovation, Steven Johnson explained that “error is not simply a phase you have to suffer through on the way to genius. Error often creates a path that leads you out of your comfortable assumptions. Being right keeps you in place. Being wrong forces you to explore.”

Mistaking signal for noise

“The trouble with error,” the psychologist Kevin Dunbar noted, “is that we have a natural tendency to dismiss it.” Dunbar’s research found many scientific experiments produced results that were genuinely unexpected, meaning that more than half of the data collected by scientists deviated significantly from what they had predicted they would find. Dunbar found that scientists tended to treat these surprising outcomes as the result of flaws in their experimental method, or a mechanical malfunction in their laboratory equipment, or an error in data processing. In other words, the scientists assumed the result was noise, not signal.

Even though poor data quality is rightfully regarded as bad, as Johnson explained, “paradigm shifts begin with anomalies in the data, when scientists find that their predictions keep turning out to be wrong.” In other words, what appears to be poor data quality – in this case in the data recording the result of an experiment – reveals the need to challenge the assumptions that the experiment was based on.

Being deliberately noisy

Johnson also cited the research of psychologist Charlan Nemeth, which included experiments that deliberately introduced noise into the decision-making process, and what she found ran directly counter to our intuitive assumptions about truth and error. According to Johnson, her research suggests “a paradoxical truth about innovation: good ideas are more likely to emerge in environments that contain a certain amount of noise and error. You would think that innovation would be more strongly correlated with accuracy, clarity, and focus. A good idea has to be correct on some basic level, and we value good ideas because they tend to have a high signal-to-noise ratio. But that doesn’t mean you want to cultivate those ideas in noise-free environments, because they end up being too sterile and predictable in their output. The best innovation labs are always a little contaminated.” Read More »

Post a Comment

Master data application services

Last time we started to discuss the strategy for applications to transition to using master data services. At the top of our master data services stack, we have the external- or application-facing capabilities. But first, let’s review the lifecycle of data about entities, namely: creating a new entity record, reading a record for an existing entity and updating an entity’s record.

We would normally add “retirement” of an entity’s record as well, but instead we can fold that into a more general categorization of “transition” of entity data, which includes retirement or deletion as well as conflation of one entity record with another when the two records are determined to represent the same real-world entity.

The goal of providing application-level master data services is to augment existing capabilities with the benefits of MDM: unique identifiability, identity resolution, uniform identifiers that can be shared across different enterprise applications, improved data quality and standard representation of common reference information.

That being said, some of the services that can be exposed to the applications include:

  • Searching for an existing record for an entity and determining membership in the data set. For example, when there is an attempt to add a new customer into the data set, first check to see if the customer is already known.
  • Retrieving one or more existing records that may represent an entity. Here, an example happens when one or more master records match with a provided set of identifying attributes. In that scenario, a process might be adjusted to let an actor determine if the data in any of the returned records is close enough to the provided identifying attributes to presume equivalence. Read More »
Post a Comment

Leaders need to shine a light on their data

These days there is endless talk about data: how to use it, how to value it, where to get it, how to secure it and when to measure it. Data is pervasive, and it is beginning to influence our society with increasing impact and accelerating velocity. Let’s examine the effect on the quality of data has on how organizations are perceived and valued (or in some cases de-valued).

Large organizations are increasingly investing hundreds of millions of dollars in advertising and public relations. Take Super Bowl commercials and political lobbying, for example. In both cases, the usual intent is to build a brand and to influence decision makers.

Yet, I often wonder why the executives who are so focused on branding and building influence close their eyes to data quality problems and label them as part of the fabric of business life. It’s similar to how we have come to accept air pollution as the price to pay for living in a big city; the response to poor-quality data is often just a shrug of the shoulders.

While organizations are highly sensitive to social perceptions and overall brand management, suboptimal data can have a big effect on brand perceptions. The question, then, is why do many organizations underestimate its impact?

One reason is that the effects of poor data quality are often hidden in other issues. For example, when a long-time customer calls customer service to discuss problems with a recently purchased lawn mower, he is happy that the representative suggests checking the warranty status. But then he is quickly confused when he is asked for the model number. “Don’t you have that information reflected in your system? I’m a long-time member of your loyalty program.”

Employees on the front lines encounter suboptimal data in nearly every application, on an almost daily basis. After years with no discernible improvements, what other conclusion can employees have than to say that data is not a valued component of the business? Read More »

Post a Comment

Creating the data quality franchise

One of the growing trends I’m witnessing when talking to Data Quality Pro’s guest interviewees is the use of federated data quality tactics.

The idea is a simple but compelling one. Rather than having a large team that manages data quality across the organisation, you create satellite teams that adopt common frameworks, tools and techniques. These satellite teams coordinate their own initiatives locally.

One of our members was quizzing me on this approach recently and I likened it to the concept of a "data quality franchise."

Franchises are one of the most popular methods of creating a successful business, so I think this metaphor works well for federated data quality. It delivers:

  • An established operating model - Someone has already figured out the mechanism of running the business and all the intricate processes involved.

  • Startup support - Creating pockets of data quality capability around the organisation requires specialist knowledge to get off the ground. So, as with any franchise, this expertise can be drawn from a central team or centre of excellence.

  • Ongoing support - Once the initial foundations are in place you may find that regular support is forthcoming from a central resource. The central team can also look at what has worked well for other teams and then feed those innovations across each team.

  • Pre-designed materials and assets - Starting a data quality initiative for the first time can be overwhelming, and one of the biggest challenges is understanding which dimensions, rules and various forms to complete. Having a central resource where these are created and shared out amongst the various satellite teams removes a lot of the guesswork. Read More »

Post a Comment

Better data through visualization

While we live in an era of big data, it's folly to claim that all data is accurate. Just because you read something on the internet doesn't make it true. In this post, I'll look at two organizations that are working to increase data accuracy and transparency.

I'll spare you my entire rant on the subject, but many organizations have a hard time determining who reports to whom. Astonishing. And, if they don't know themselves, it's often hard for the outside world to know, as well. (Note that this may be far from accidental. As I know all too well, certain organizations are opaque by design.)

That aside for the moment, Transparentrees seeks to fix this problem, at least in the corporate world. From its site:

Inaccurate information might just be the single greatest cause of inefficiency. Information can be old, incomplete, embellished, or untruthful. We think we’ve figured out a way to culture accurate information for every organization and business professional in existence. That’s our goal.

It's an interesting idea, although it's still in its infancy. (By no means is the site comprehensive.) Still, it shows promise as a way of making things, well, clearer. Case in point: Want to see Apple's current org chart? Here's a snapshot featuring Apple marketing head honcho Philip Schiller: Read More »

Post a Comment

Lean against bias for accurate analytics

We sometimes describe the potential of big data analytics as letting the data tell its story, casting the data scientist as storyteller. While the journalist has long been a newscaster, in recent years the term data-driven journalism has been adopted to describe the process of using big data analytics to create a news story.

One of the concerns that Daniel Kahneman expressed during a recent interview about journalism is that its stories are too good. “The stories are oversimplified and exaggerate the coherence of the information. This is something that comes naturally for journalists to do but that’s also what the reading public demands. On both sides of this, there’s an eagerness to produce stories that are coherent and to hear stories that are coherent. So the stories are simpler than reality and in some ways better than the true stories.”

This two-sided challenge also exists within companies. Analytical teams are eager to produce data-driven stories that are coherent, and business leaders are eager to hear data-driven stories that are coherent. The distillation of complexity that occurs with big data analytics means that data-driven stories are always simpler than reality. Read More »

Post a Comment

Engineering the master data services stack for application integration

In the last two series of posts we have been discussing the challenges of application integration with a maturing master data management (MDM) repository and index, and an approach that attempts to easily enable existing applications to incrementally adopt the use of MDM. This approach involves developing a tiered architecture that is flexible enough to maintain a synchronous master repository and index. It also provides a variety of master data services spanning the different usage scenarios for shared access to a presentation of information about uniquely identifiable entities. The layers in this tiered architecture include:

  • A data layer for managing the physical storage medium for shared master data;
  • An access layer that can enable access to the data layer in a way that satisfies the load, response time and bandwidth requirements of multiple applications running simultaneously; and
  • An application services layer comprising the core services common to multiple applications and business processes to provide business-oriented capabilities supporting production applications. Read More »
Post a Comment

Struggling to get started with data quality? Start with data lineage

Many people don’t know where to start with data quality. They get bogged down with questions on dimensions, ownerships, rules and tools. The problem can seem too vast to even begin making sense of their data landscape, let alone transforming it into a well-governed and high-quality asset.

A lot of data quality initiatives start with one person who has the drive but may lack personnel to support them. Tools and technology may not be freely available until value has been demonstrated. I’ve been in this situation several times in my career and found the best place to start is always the lineage of your data, because it lays the perfect foundation for developing a data governance and data quality management framework everyone can buy into.

Data lineage is one of those techniques that anyone can learn to master overnight. It is more an exercise in persistence and stubborness as opposed to any methodology-based technique.

The first step is to speak with your business users and identify which critical data elements drive the core business functions. For example, customer data, product data, billing data and contract data may be critical to your organisation. Read More »

Post a Comment