Data management lessons from Google

In a previous post, I explored the data-management strategies that companies can learn from Amazon. In short, there's a great deal to admire about how Jeff Bezos, et. al, manage ungodly amounts of data.

Today, I'll turn my lens towards another big data heavyweight: Google. (You may have heard of it.)

A Truly Data-Centric Business Model

"Data is a tremendously important corporate asset."


How many times have you heard a CXO say something to that effect? I'd bet quite often. Most of the time, those words fall on deaf ears. That is, an organization's actions betray its credo. Put differently, there's a world of difference between talking the talk and walking the walk.

Not at Google. If anything, the company might be too good at the data game. Read More »

Post a Comment

Self-service big data preparation in the age of Hadoop

In April, the free trial of SAS Data Loader for Hadoop became available globally. Now, you can take a test drive of our new technology designed to increase the speed and ease of managing data within Hadoop. The downloads might take a while (after all, this is big data), but I think you’ll be pleasantly surprised at how quickly  you can manipulate data on Hadoop distributions (offered by our partners Hortonworks and Cloudera).

SAS Data Loader for Hadoop is the latest technology purpose-built and optimized for Hadoop as more of our customers adopt the low-cost, high-capacity way to manage massive amounts of data. Industry estimates that Hadoop usage has increased by 60 percent in the last two years. For our customers, it's important to apply analytics and data management "from," "with," and "in" Hadoop. SAS technologies can read and write data on Hadoop (from), lift data in parallel into memory to run high-performance analytics (with), and run analytical processing inside the Hadoop cluster (in) for improved performance, reduced data movement and improved governance. Read More »

Post a Comment

In data governance’s service: data virtualization, part 1

Data governance and data virtualization can become powerful allies. The word governance is not be understood here as a law but more as a support and vision for business analytics application. Our governance processes must become agile the same way our business is transforming. Data virtualization, being a very versatile tool, can give a fast track to gaining that flexibility.

Data virtualization is a pretty simple and flexible tool. All you get is a platform, that connects to data sources and make all of data found in them available further. Of course that’s not all – you get security, caching and auditing.

One can say that this kind of functionality is common, and every ETL or data access engine can do the same. It is not true. Let's discuss two of the many possible implementation patterns for data virtualization. The first style sits on data warehouses and other areas of raw data as a guard for data consumers. The second is on the source data side and supports data unloading and preparation. Each has advantages, that can put your data governance activities on steroids. Read More »

Post a Comment

Filter, format and deliver: Managing data for analytics

In the last post, we talked about creating the requirements for the data analytics, and profiling the data prior to load.  Now, let’s consider how to filter, format and deliver that data to the analytics application.

  • Filter – the act of selecting the data of interest to be used in the analytic application.  For example, you may only want customers who have been more than 5 days late with their payment. Or, you may be looking for customers who never use a specific credit card.
  • Format – the act of conforming the data to be used in the analysis.  This may entail adding a default code or verbiage for null or blank columns.  Recoding specific columns based on target source or changing datatypes in the data.  For example, you can cast a specific date format to create consistency for the target analytic application.
  • Deliver – the act of physically providing the data for the analysis.  This data could be a .csv file, xml file, database table(s) or a physical load into the analytics application using a specific import utility or tool.

Read More »

Post a Comment

Junk drawers and data analytics

In the era of big data, we collect, prepare, manage, and analyze a lot of data that is supposed to provide us with a better picture of our customers, partners, products, and services. These vast data murals are impressive to behold, but in painting such a broad canvas, these pictures might be impressive works of art, but sketchy splotches of science beset by statistical and systematic errors as well as sampling bias. And when taken out of context the analysis of all this data may only reveal superficial correlations, not deep insights.

You might be able to learn something about me, for example, by analyzing the contents of the junk drawer in my kitchen. In it you would find not only a few dozen paper clips and rubber bands of various sizes, several dollars worth of loose change, and coupons, most expired, for local restaurants and shops, but also a wide assortment of other items. Among them would be frequent flyer cards for just about every major airline and loyalty cards for just about every major hotel chain and car rental agency, player cards for half the casinos on the Las Vegas strip, three wristwatches, battery chargers for half a dozen mobile phones, instruction manuals for numerous small electronics, boxes of business cards from my last five jobs, and maps of local hiking and biking trails. Read More »

Post a Comment

Accurate data definitions: The keystone to trusted data analytics?

One area that often gets overlooked when building out a new data analytics solution is the importance of ensuring accurate and robust data definitions.

This is one of those issues that is difficult to detect because unlike a data quality defect, there are no alarms or reports to indicate a fault; the user simply infers the meaning of the analytics in an incorrect fashion.

Let’s take a car dealership for example. A new IT manager decides to invest in an analytics solution and provides a dashboard to executives. The CEO wants to know how well the business is growing so they observe customer growth which shows a steady increase over time.

The analytics system receives regular feeds of data from over 250 dealerships across the country but many use their own customer management systems. This is where the problems begin. Read More »

Post a Comment

Data preparation: Managing data for analytics

What data do you prepare to analysis?  Where does that data come from in the enterprise?  Hopefully, by answering these questions, we can understand what is required to supply data for an analytics process.

Data preparation is the act of cleansing (or not) the data required to meet the business needs specified in the analytic requirements. Much like any other requirements document, the sources of the data must be identified.  Then, you have to identify the data out of these sources. Read More »

Post a Comment

Analytics lessons from Amazon

For much of history, most commerce has been fairly reactive. Think about it. Consumers decided that they needed a new tchotchke and walked or drove to a store to actually buy it.

Sure, there have been variations on the theme. These have included mail orders, deliveries, cold-calling, door-knocking, and, relatively recently, online purchases. For the most part, though, a consumer made a conscious decision to part with his or her money. Adam Smith would be proud.

Read More »

Post a Comment

Business needs and performance expectations: Data management for analytics

In the last few days, I have heard the term “data lake” bandied about in various client conversations. As with all buzz-term simplifications, the concept of a “data lake” seems appealing, particularly when it is implied to mean “a framework enabling general data accessibility for enterprise information assets.” And of course, as with all simplifications, the data lake comes bundled with presumptions, mostly centered on the deployment platform (that is: Hadoop and HDFS) that somewhat ignore two key things: who is the intended audience for data accessibility, and what do they want to accomplish by having access to that data.

I’d suggest that the usage scenario is obviously centered on reporting and analytics; the processes access the data used for transaction processing or operational systems would already be engineered to the application owners’ specifications. Rather, the data lake concept is for empowering analysts to bypass IT when there are no compelling reasons to have to endure the bottlenecks and bureaucracy that have become the IT division’s typical modus operandi. Read More »

Post a Comment

Hadoop is not Beetlejuice

In the 1988 film Beetlejuice, the title character, hilariously portrayed by Michael Keaton, is a bio exorcist (a ghost capable of scaring the living) hired by a recently deceased couple in an attempt to scare off the new owners of their house. Beetlejuice is summoned by saying his name three times. (Beetlejuice. Beetlejuice. Beetlejuice.)

Nowadays it seems like every time big data is discussed, Hadoop comes up in the conversation eventually. Need a way to deal with all that unstructured data? Hadoop. Need a way to integrate big data into your other applications? Hadoop. Need a replacement for your clunky old data warehouse? Hadoop.

Hadoop. Hadoop. Hadoop. It’s like saying it three times summons a data exorcist, a supernatural solution to all your big data problems. Unfortunately, neither big data nor Hadoop is that simple. I decided, therefore, to conduct a seance with a few of my fellow Data Roundtablers and exorcise three common misconceptions about Hadoop. Read More »

Post a Comment