Junk drawers and data analytics

In the era of big data, we collect, prepare, manage, and analyze a lot of data that is supposed to provide us with a better picture of our customers, partners, products, and services. These vast data murals are impressive to behold, but in painting such a broad canvas, these pictures might be impressive works of art, but sketchy splotches of science beset by statistical and systematic errors as well as sampling bias. And when taken out of context the analysis of all this data may only reveal superficial correlations, not deep insights.

You might be able to learn something about me, for example, by analyzing the contents of the junk drawer in my kitchen. In it you would find not only a few dozen paper clips and rubber bands of various sizes, several dollars worth of loose change, and coupons, most expired, for local restaurants and shops, but also a wide assortment of other items. Among them would be frequent flyer cards for just about every major airline and loyalty cards for just about every major hotel chain and car rental agency, player cards for half the casinos on the Las Vegas strip, three wristwatches, battery chargers for half a dozen mobile phones, instruction manuals for numerous small electronics, boxes of business cards from my last five jobs, and maps of local hiking and biking trails. Read More »

Post a Comment

Accurate data definitions: The keystone to trusted data analytics?

One area that often gets overlooked when building out a new data analytics solution is the importance of ensuring accurate and robust data definitions.

This is one of those issues that is difficult to detect because unlike a data quality defect, there are no alarms or reports to indicate a fault; the user simply infers the meaning of the analytics in an incorrect fashion.

Let’s take a car dealership for example. A new IT manager decides to invest in an analytics solution and provides a dashboard to executives. The CEO wants to know how well the business is growing so they observe customer growth which shows a steady increase over time.

The analytics system receives regular feeds of data from over 250 dealerships across the country but many use their own customer management systems. This is where the problems begin. Read More »

Post a Comment

Data preparation: Managing data for analytics

What data do you prepare to analysis?  Where does that data come from in the enterprise?  Hopefully, by answering these questions, we can understand what is required to supply data for an analytics process.

Data preparation is the act of cleansing (or not) the data required to meet the business needs specified in the analytic requirements. Much like any other requirements document, the sources of the data must be identified.  Then, you have to identify the data out of these sources. Read More »

Post a Comment

Analytics lessons from Amazon

For much of history, most commerce has been fairly reactive. Think about it. Consumers decided that they needed a new tchotchke and walked or drove to a store to actually buy it.

Sure, there have been variations on the theme. These have included mail orders, deliveries, cold-calling, door-knocking, and, relatively recently, online purchases. For the most part, though, a consumer made a conscious decision to part with his or her money. Adam Smith would be proud.

Read More »

Post a Comment

Business needs and performance expectations: Data management for analytics

In the last few days, I have heard the term “data lake” bandied about in various client conversations. As with all buzz-term simplifications, the concept of a “data lake” seems appealing, particularly when it is implied to mean “a framework enabling general data accessibility for enterprise information assets.” And of course, as with all simplifications, the data lake comes bundled with presumptions, mostly centered on the deployment platform (that is: Hadoop and HDFS) that somewhat ignore two key things: who is the intended audience for data accessibility, and what do they want to accomplish by having access to that data.

I’d suggest that the usage scenario is obviously centered on reporting and analytics; the processes access the data used for transaction processing or operational systems would already be engineered to the application owners’ specifications. Rather, the data lake concept is for empowering analysts to bypass IT when there are no compelling reasons to have to endure the bottlenecks and bureaucracy that have become the IT division’s typical modus operandi. Read More »

Post a Comment

Hadoop is not Beetlejuice

In the 1988 film Beetlejuice, the title character, hilariously portrayed by Michael Keaton, is a bio exorcist (a ghost capable of scaring the living) hired by a recently deceased couple in an attempt to scare off the new owners of their house. Beetlejuice is summoned by saying his name three times. (Beetlejuice. Beetlejuice. Beetlejuice.)

Nowadays it seems like every time big data is discussed, Hadoop comes up in the conversation eventually. Need a way to deal with all that unstructured data? Hadoop. Need a way to integrate big data into your other applications? Hadoop. Need a replacement for your clunky old data warehouse? Hadoop.

Hadoop. Hadoop. Hadoop. It’s like saying it three times summons a data exorcist, a supernatural solution to all your big data problems. Unfortunately, neither big data nor Hadoop is that simple. I decided, therefore, to conduct a seance with a few of my fellow Data Roundtablers and exorcise three common misconceptions about Hadoop. Read More »

Post a Comment

Differentiating process, persistence, and publication: Data management for analytics

As part of two of our client engagements, we have been tasked with providing guidance on an analytics environment platform strategy. More concretely, the goal is to assess the systems that currently compose the “data warehouse environment” and determine what the considerations are for determining the optimal platforms to support the objectives of each reporting/analytics system.

Interestingly, yet not surprisingly, the systems are designed to execute data warehousing tasks of varying complexity and to support a community of users whose skills range from loading extracts into excel to make simple bar charts to sophisticated analysts performing data discovery and predictive/prescriptive analytics. Yet the siloed approach to management, design, and development of these systems has led to some unintentional replication.

The challenge is to determine what that replication really is. In some cases there is obvious data replication: a data subset is extracted from a set of files and is then loaded into an auxiliary data mart. In some cases the replication is process replication: an input data source is scanned, parsed, and transformed for loading into one data environment, and later the same data is pulled from the data environment, scanned, parsed, and transformed prior to loading into a data system that lies further downstream. Read More »

Post a Comment

Lineage, data quality and continuity: Keeping your data analytics healthy

The adoption of data analytics in organisations is widespread these days. Due to the lower costs of ownership and increased ease of deployment, there are realistically no barriers for any organisation wishing to exploit more from their data.

This of course presents a challenge because the rate of data analytics adoption has not always been reflected in the rate of sound data management practices when it comes to ensuring quality along the analytics information chain.

In the past, many organisations adopted the data warehouse model for enabling analytics meaning that there was at least some kind of buffer between operational systems and the management reporting layer.

Many organisations choose to ignore the data warehouse approach and suck data straight from live, operational systems, in order to perform regular analytics or perhaps ad hoc analysis of specific issues.

This is where issues can creep in because the temptation to analyse poor quality data is ever present. Read More »

Post a Comment

The need for speed: A look at real-time analytics

You may feel like the world is moving faster than ever. If so, then you can take solace in two facts:

  • You're not alone in feeling this way.
  • You're right. It is.

Celebrating the 25-year anniversary of the Web, The Economist ran a piece examining the increasingly rapid adoption of new technologies. From it:

It took only seven years from the first web pages in 1991 for the web to be used by a quarter of the American population. That compares with 46 years for electricity, 35 years for the phone and 26 years for television. The web, just 25 years old, is still at the start of its life.

Read More »

Post a Comment

Hadoop and big data management: How does it fit in the enterprise?

The other day, I was looking at an enterprise architecture diagram, and it actually showed a connection between the marketing database, the Hadoop server and the data warehouse.  My response can be summed up in two ways. First, I was amazed! Second, I was very interested on how this customer uses Hadoop.  Read More »

Post a Comment