Getting started with big data

"Most” organizations are embracing big data. For instance, a 2013 Gartner survey found that 64 percent of enterprises were deploying or planning big data projects, up from 58% the year before.

Those numbers simply don't fit with what I’m seeing, and I suspect that I'm hardly alone. (By way of background, I spent a great deal of time researching big data for my last two books.) I would wager that for every Amazon, Apple, Facebook, Twitter, Netflix and Google, there are thousands of mid-sized and large organizations that are doing very little or nothing with big data.

Why the lack of implementation? Many reasons come to mind, not the least of which is that the incessant noise around big data from social media intimidates and confuses people. The hype cycle is in full swing.

Big questions

Many CXOs are left not knowing where to begin. They are asking themselves questions such as:

  • Should we start small or large?
  • Is big data just another  IT project that can be run by a unit head?
  • Or is it something so vast and nebulous that it needs to be fully backed by the people at the top of the organization?
  • If the former, should we rent data scientists on sites like Kaggle? If the latter, how can the entire organization embrace around big data? Read More »
Post a Comment

The dark side of the mood

As an unabashed lover of data, I am thrilled to be living and working in our increasingly data-constructed world. One new type of data analysis eliciting strong emotional reactions these days is the sentiment analysis of the directly digitized feedback from customers provided via their online reviews, emails, voicemails, text messages and social networking status updates, where word of mouth has become word of data.

Sentiment analysis is often marketed as an essential component of understanding what your customers really think about your products and services. Although it can definitely be valuable, sentiment analysis can also suffer from what is known in psychology as negativity bias.

Customers, like all people, pay more attention to, and give more weight to, negative experiences. If you doubt the sentiment of that statement, I welcome you to compare being complimented with being insulted. Which are you more likely to remember? Which are you more likely to tell people about?

Yes, self-aggrandizing people often brag about the compliments they received, as well as invent compliments they never received. And, of course, this aspect also skews sentiment analysis because false praise and fake five-star reviews arise as a result of the dark side of social media marketing. Read More »

Post a Comment

Master data manipulation services

To a great extent, the data manipulation layer of our multi-tiered master data services mimics the capabilities of the application services discussed in the previous posting. However, the value of segregating the actual data manipulation from the application-facing API is that the latter can be developed within the consuming application’s code base as a wrapper over its own data subsystem. That provides a façade for the application that guarantees consistent results, while the API targets its own data layer – and yet provides a “dual-tracked” means of selectively transitioning to the master services.

For example, an application might use the search mechanism to see if an entity record exists. At first the application can do a lookup in its own data subsystem, then switch to use the master entity lookup, which may provide better identity resolution.

The database management layer of the service stack can include any or all of these types of services: Read More »

Post a Comment

How to improve your data profiling performance

Data profiling is a core technique of data quality management and often the starting point for so many projects these days. Because it’s such a relatively simple technique to apply, it’s easy to overlook some of the more advanced techniques that can take your data profiling to the next level. I'd like to share some "profiling power tips" to help improve your data profiling skills.

Power tip : Segment your data

Profiling tools give you the ability to check for statistics such as uniqueness, completeness, value distribution, format distribution and a whole host of other metrics.

The problem is that profiling is often carried out on an entire set of data, but this can have the effect of seriously skewing the findings.

For example, in one past assignment we analysed a utilities system that contained millions of equipment records. We were particularly interested in the power rating value and whether it was populated or not.

Profiling this attribute in isolation told us that the organisation had a great deal of work to do in order to improve the power rating value. When we segmented the data by criteria – such as active equipment that wasn’t facing decommission – we realised the picture wasn’t as bleak as the initial profiling stats led us to believe.

The lesson here is that it’s vital to split your records into distinct subsets so you can make greater sense of what is really happening with your data quality. You will often be bombarded with questions by business users and sponsors when you present your profiling findings, so make sure you’ve explored many angles with your analysis. Read More »

Post a Comment

Direct data monetization

With respect to data, there seem to be a few types of companies:

  • Those that do fairly little with the value of their data. I've consulted for quite a few.
  • Those that maximize the value of their data, often controversially. Facebook and Google are squarely in this group.
  • Those that maximize the value of their data behind the scenes. Axciom is perhaps the prime example.

Of course, these are generalizations. Many companies fall somewhere in between. As the data deluge continues, we're beginning to see new types of business-model experimentation emerging, particularly with directly monetizing data. As the Chinese say, in crisis there is opportunity.

This is why I found the recent Propublica announcement so interesting. It has launched a Data Store, much like Apple's AppStore. (Like Angry Birds, maybe you'll be able to buy Angry Data one day?The main difference: you can buy data, not apps. From the site:

In the Data Store, you'll find a growing collection of the data we've used in our reporting. For raw, as-is datasets we receive from government sources, you'll find a free download link that simply requires you agree to a simplified version of our Terms of Use. For datasets that are available as downloads from government websites, we've simply linked to the sites to ensure you can quickly get the most up-to-date data. Read More »

Post a Comment

Innovation needs contamination

In his book Where Good Ideas Come From: The Natural History of Innovation, Steven Johnson explained that “error is not simply a phase you have to suffer through on the way to genius. Error often creates a path that leads you out of your comfortable assumptions. Being right keeps you in place. Being wrong forces you to explore.”

Mistaking signal for noise

“The trouble with error,” the psychologist Kevin Dunbar noted, “is that we have a natural tendency to dismiss it.” Dunbar’s research found many scientific experiments produced results that were genuinely unexpected, meaning that more than half of the data collected by scientists deviated significantly from what they had predicted they would find. Dunbar found that scientists tended to treat these surprising outcomes as the result of flaws in their experimental method, or a mechanical malfunction in their laboratory equipment, or an error in data processing. In other words, the scientists assumed the result was noise, not signal.

Even though poor data quality is rightfully regarded as bad, as Johnson explained, “paradigm shifts begin with anomalies in the data, when scientists find that their predictions keep turning out to be wrong.” In other words, what appears to be poor data quality – in this case in the data recording the result of an experiment – reveals the need to challenge the assumptions that the experiment was based on.

Being deliberately noisy

Johnson also cited the research of psychologist Charlan Nemeth, which included experiments that deliberately introduced noise into the decision-making process, and what she found ran directly counter to our intuitive assumptions about truth and error. According to Johnson, her research suggests “a paradoxical truth about innovation: good ideas are more likely to emerge in environments that contain a certain amount of noise and error. You would think that innovation would be more strongly correlated with accuracy, clarity, and focus. A good idea has to be correct on some basic level, and we value good ideas because they tend to have a high signal-to-noise ratio. But that doesn’t mean you want to cultivate those ideas in noise-free environments, because they end up being too sterile and predictable in their output. The best innovation labs are always a little contaminated.” Read More »

Post a Comment

Master data application services

Last time we started to discuss the strategy for applications to transition to using master data services. At the top of our master data services stack, we have the external- or application-facing capabilities. But first, let’s review the lifecycle of data about entities, namely: creating a new entity record, reading a record for an existing entity and updating an entity’s record.

We would normally add “retirement” of an entity’s record as well, but instead we can fold that into a more general categorization of “transition” of entity data, which includes retirement or deletion as well as conflation of one entity record with another when the two records are determined to represent the same real-world entity.

The goal of providing application-level master data services is to augment existing capabilities with the benefits of MDM: unique identifiability, identity resolution, uniform identifiers that can be shared across different enterprise applications, improved data quality and standard representation of common reference information.

That being said, some of the services that can be exposed to the applications include:

  • Searching for an existing record for an entity and determining membership in the data set. For example, when there is an attempt to add a new customer into the data set, first check to see if the customer is already known.
  • Retrieving one or more existing records that may represent an entity. Here, an example happens when one or more master records match with a provided set of identifying attributes. In that scenario, a process might be adjusted to let an actor determine if the data in any of the returned records is close enough to the provided identifying attributes to presume equivalence. Read More »
Post a Comment

Leaders need to shine a light on their data

These days there is endless talk about data: how to use it, how to value it, where to get it, how to secure it and when to measure it. Data is pervasive, and it is beginning to influence our society with increasing impact and accelerating velocity. Let’s examine the effect on the quality of data has on how organizations are perceived and valued (or in some cases de-valued).

Large organizations are increasingly investing hundreds of millions of dollars in advertising and public relations. Take Super Bowl commercials and political lobbying, for example. In both cases, the usual intent is to build a brand and to influence decision makers.

Yet, I often wonder why the executives who are so focused on branding and building influence close their eyes to data quality problems and label them as part of the fabric of business life. It’s similar to how we have come to accept air pollution as the price to pay for living in a big city; the response to poor-quality data is often just a shrug of the shoulders.

While organizations are highly sensitive to social perceptions and overall brand management, suboptimal data can have a big effect on brand perceptions. The question, then, is why do many organizations underestimate its impact?

One reason is that the effects of poor data quality are often hidden in other issues. For example, when a long-time customer calls customer service to discuss problems with a recently purchased lawn mower, he is happy that the representative suggests checking the warranty status. But then he is quickly confused when he is asked for the model number. “Don’t you have that information reflected in your system? I’m a long-time member of your loyalty program.”

Employees on the front lines encounter suboptimal data in nearly every application, on an almost daily basis. After years with no discernible improvements, what other conclusion can employees have than to say that data is not a valued component of the business? Read More »

Post a Comment

Creating the data quality franchise

One of the growing trends I’m witnessing when talking to Data Quality Pro’s guest interviewees is the use of federated data quality tactics.

The idea is a simple but compelling one. Rather than having a large team that manages data quality across the organisation, you create satellite teams that adopt common frameworks, tools and techniques. These satellite teams coordinate their own initiatives locally.

One of our members was quizzing me on this approach recently and I likened it to the concept of a "data quality franchise."

Franchises are one of the most popular methods of creating a successful business, so I think this metaphor works well for federated data quality. It delivers:

  • An established operating model - Someone has already figured out the mechanism of running the business and all the intricate processes involved.

  • Startup support - Creating pockets of data quality capability around the organisation requires specialist knowledge to get off the ground. So, as with any franchise, this expertise can be drawn from a central team or centre of excellence.

  • Ongoing support - Once the initial foundations are in place you may find that regular support is forthcoming from a central resource. The central team can also look at what has worked well for other teams and then feed those innovations across each team.

  • Pre-designed materials and assets - Starting a data quality initiative for the first time can be overwhelming, and one of the biggest challenges is understanding which dimensions, rules and various forms to complete. Having a central resource where these are created and shared out amongst the various satellite teams removes a lot of the guesswork. Read More »

Post a Comment

Better data through visualization

While we live in an era of big data, it's folly to claim that all data is accurate. Just because you read something on the internet doesn't make it true. In this post, I'll look at two organizations that are working to increase data accuracy and transparency.

I'll spare you my entire rant on the subject, but many organizations have a hard time determining who reports to whom. Astonishing. And, if they don't know themselves, it's often hard for the outside world to know, as well. (Note that this may be far from accidental. As I know all too well, certain organizations are opaque by design.)

That aside for the moment, Transparentrees seeks to fix this problem, at least in the corporate world. From its site:

Inaccurate information might just be the single greatest cause of inefficiency. Information can be old, incomplete, embellished, or untruthful. We think we’ve figured out a way to culture accurate information for every organization and business professional in existence. That’s our goal.

It's an interesting idea, although it's still in its infancy. (By no means is the site comprehensive.) Still, it shows promise as a way of making things, well, clearer. Case in point: Want to see Apple's current org chart? Here's a snapshot featuring Apple marketing head honcho Philip Schiller: Read More »

Post a Comment