Big data integration: The case against an "all-in" approach

I've spent a great deal of time in my consulting career railing against multiple systems of record, data silos and disparate versions of the truth. In the mid-1990s, I realized that Excel could only do so much. To quickly identify and ultimately ameliorate thorny data issues, I had to up [...]

Post a Comment

Data quality on Hadoop: The easy way

Bigger doesn’t always mean better. And that’s often the case with big data. Your data quality (DQ) problem – no denial, please – often only magnifies when you get bigger data sets. Having more unstructured data adds another level of complexity. The need for data quality on Hadoop is shown by user [...]

Post a Comment

Hadoop is not Beetlejuice

In the 1988 film Beetlejuice, the title character, hilariously portrayed by Michael Keaton, is a bio exorcist (a ghost capable of scaring the living) hired by a recently deceased couple in an attempt to scare off the new owners of their house. Beetlejuice is summoned by saying his name three times. (Beetlejuice. Beetlejuice. Beetlejuice.) Nowadays [...]

Post a Comment

Hadoop and big data management: How does it fit in the enterprise?

The other day, I was looking at an enterprise architecture diagram, and it actually showed a connection between the marketing database, the Hadoop server and the data warehouse.  My response can be summed up in two ways. First, I was amazed! Second, I was very interested on how this customer uses [...]

Post a Comment

EMC and SAS redefine big data analytics with the data lake

Adoption of Hadoop, a low-cost open source platform used for processing and storing massive amounts of data, has exploded by almost 60 percent in the last two years alone according to Gartner. One primary use case for Hadoop is as a data lake – a vast store of raw, minimally processed data. But, in many ways, because [...]

Post a Comment

Provisioning data for advanced analytics in Hadoop

The data lake is a great place to take a swim, but is the water clean? My colleague, Matthew Magne, compared big data to the Fire Swamp from The Princess Bride, and it can seem that foreboding. The questions we need to ask are: How was the data transformed and [...]

Post a Comment

Using Hadoop: Emerging options for improved query performance

In my last two posts, we concluded two things. First, because of the need for broadcasting data across the internal network to enable the complete execution of a JOIN query in Hadoop, there is a potential for performance degradation for JOINs on top of files distributed using HDFS. Second, there are [...]

Post a Comment

Using Hadoop: Query optimization

In my last post, I pointed out that an uninformed approach to running queries on top of data stored in Hadoop HDFS may lead to unexpected performance degradation for reporting and analysis. The key issue had to do with JOINs in which all the records in one data set needed [...]

Post a Comment

Using Hadoop: Impacts of data organization on access latency

Hadoop is increasingly being adopted as the go-to platform for large-scale data analytics. However, it is still not necessarily clear that Hadoop is always the optimal choice for traditional data warehousing for reporting and analysis, especially in its “out of the box” configuration. That is because Hadoop itself is not [...]

Post a Comment