Using Hadoop: Query optimization

In my last post, I pointed out that an uninformed approach to running queries on top of data stored in Hadoop HDFS may lead to unexpected performance degradation for reporting and analysis. The key issue had to do with JOINs in which all the records in one data set needed to be compared with all the records in a second data set. The need to look at the cross-product of the two data sets lead to a data latency nightmare, requiring broadcasts of each data chunk to all the computing nodes that flood the network. Read More »

Post a Comment

Showing the ugly face of bad data: Part 1

Financial institutions are mired with large pools of historic data across multiple line of businesses and systems. However, much of the recent data is being produced externally and is isolated from the decision making and operational banking processes. The limitations of existing banking systems combined with inward-looking and confined data practices makes it tough to make sense of both structured and unstructured data as it arrives.

Simply accessing large amount of data across devices and platforms is challenging, but the story doesn’t end here. A highly connected and online consumer world has changed the once laid-back decision making approach that used to be the forte of branch banking. Read More »

Post a Comment

What skills will be required to make sense of big data?

For enterprises across the globe, I'm hard-pressed to think of a more game-changing development than the advent of relational databases. One could write a very long book about how they have vastly improved organizations' ability to collect, store, process, and act on customer, employee, and product information—at least certain types of information. Yes, I'm talking about the structured kind.

Will the big data revolution lead to similar advancements?

Read More »

Post a Comment

Big wishes for data management

In the movie Big, a 12-year-old boy, after being embarrassed in front of an older girl he was trying to impress by being told he was too short for a carnival ride, puts a coin into an antique arcade fortune teller machine called Zoltar Speaks, makes a wish to be big, and awakes the next morning transformed into a 30-year-old man.

Traditional data management is making a wish to integrate big data efforts into existing processes and programs in order to transform the organization into a 21st century data-driven enterprise. Since Zoltar Speaks doesn’t really grant wishes, especially big data wishes, let’s briefly look at a few aspects of big data integration. Read More »

Post a Comment

Using Hadoop: Impacts of data organization on access latency

Hadoop is increasingly being adopted as the go-to platform for large-scale data analytics. However, it is still not necessarily clear that Hadoop is always the optimal choice for traditional data warehousing for reporting and analysis, especially in its “out of the box” configuration. That is because Hadoop itself is not a database, even though there are some data organization methods that are adapted to and firmly ingrained within the distributed architecture.

The first is the distributed file organization itself – the Hadoop Distributed File System, or HDFS. And while the data organization provided by HDFS is intended to provide linear scalability for capturing large data volumes, aspects of HDF will impact the performance of reporting and analytical applications.

Hadoop is deployed across a collection of computing and data nodes and correspondingly, your data files are likewise distributed across the different data nodes. But because one of the foundational aspects of Hadoop is fault-tolerance, there is an expectation of potential component failure, and to mitigate this risk, not only does HDFS distribute your file, it replicates the different chunks across different nodes so that if one node fails, the data is still accessible at another one. In fact, by default the data is stored redundantly three times. Read More »

Post a Comment

What's the future of data? In a word, more.

When asked what his movement wanted around a century ago, the iconic American labor leader Samuel Gompers famously gave a one-word answer: "More."

This annoyed his opponents at the negotiating table and many in the business community. He was not demanding a specific wage increase or fighting for a distinct cause like child-welfare laws or workweek maximums. He just wanted more—and all that that entailed.

Read More »

Post a Comment

The three myths of ongoing data quality: Financial gains based on data quality

If you are looking for a way to fund your data quality objectives, consider looking in the closets of the organization.  For example, look for issues that cost the company money that could have been avoided by better availability of data, better quality of the data or reliability of the data. Read More »

Post a Comment

Crowdsourcing data improvement: Part 3

In this blog series, I am exploring if it’s wise to crowdsource data improvement, and if the power of the crowd can enable organizations to incorporate better enterprise data quality practices.

In Part 1, I provided a high-level definition of crowdsourcing and explained that while it can be applied to a wide range of projects and activities, applying crowdsourcing to data improvement involves three aspects: type of data, kind of crowd, and form of improvement. Part 2 focused on type of data and kind of crowd. Part 3 focuses on form of improvement, which can be divided into two categories: data quality and data enrichment. Read More »

Post a Comment

Social media numbers: The data quality challenge

One of my biggest problems with social media is its emphasis on simple numbers.

That might seem like an odd statement coming from guy who bleeds data and maintains an active social media presence.

Let me explain this apparent contradiction.

Read More »

Post a Comment

The three myths of data quality: Put it in production and leave it alone

Once in a while, people run into an issue with the data that doesn't really need to be fixed right to ensure success of a specific project.  So, the data issues are put into production and forgotten.  Everyone always says, “We will go back and correct this later.”  But that never happens.  At least not with anyone I know.  If you have had the luxury of going back and making corrections after something is in production, please let me know so I can change my attitude on this issue!

The assumption is that if it is in production, and nothing broke, then all is OK!  In the first blog of this three-part series, I stated that for a data quality initiative consider checking for completeness, accuracy, and integrity of the data.  These, to me, are the most important metrics about data quality to collect and monitor.  Read More »

Post a Comment