Data integration methods

There are many ways to do data integration. Those include:

  1. Extract, transform and load (ETL) – which moves and transforms data (with some redundancy) from a source to a target. While ETL can be implemented (somewhat) in real time, it is usually executed at intervals (15 minutes, 30 minutes, 1 hour, 4 hours, 8 hours or possibly just once a day). This type of ETL is used for integration of multiple sources into operational data stores or a data warehouse.
  2. Logical data integration – requires software that will connect to multiple data stores. Rules on what attributes to get from which data stores will be required (there is nothing free in the world of data). I call this semantic metadata, and it actually says “Hey – I know how to get your data.” If you create one place to get data, and connect logically with software, the query or report will only be as fast as your slowest connection!
  3. Data integration as an event-based service – while this sounds really good, it is not so easy to implement. If your corporation has multiple software products that do the same function, integration will be very difficult with an event-based service. If the consumers of the data require A LOT of attributes, this may not be the way to go. An event can trigger when a field in a database changes (think database triggers!) and publishes the event for consumption by another process.

Read More »

Post a Comment

Big data quality – Part 1

Big data poses some interesting challenges for disciplines such as data integration and data governance, but this blog series addresses some of the most common questions big data raises related to data quality.

Does data quality matter less in larger data sets?

Some believe huge data volumes make individual data quality issues insignificant due to the law of large numbers. This view posits individual data quality issues will only make up a tiny part of the mass of big data, assuming data quality issues do not scale with increased data volume. My favorite analogy for this involves Kool-Aid. Adding one spoonful of the drink mix to a glass of water creates a tasty beverage (my favorite drink as a kid). Adding one spoonful to a gallon of water, however, will only make colorful water that still tastes like water. People who believe data quality matters less in larger data sets imagine big data pouring in gallons at a time while data quality issues trickle in only spoonfuls at a time. Don’t drink this Kool-Aid. Read More »

Post a Comment

Overcoming the IT-Business Divide in an era of big data, Part 3

In the first part of this series, I recapped several key recent trends as they related to the IT-business divide. In short, thanks to the rises of cloud computing, big data and BYOD, IT is (generally speaking) less equipped than ever to act as the gatekeeper of enterprise data. In part two, I described how big data means that IT needs to act as a facilitator.

So where does that leave us – and how do we put a nail in its coffin once and for all?

There's no simple solution for solving the divide in an era of big data, but let me suggest three not-so-dangerous ideas. Read More »

Post a Comment

Apache YARN to become “the operating system for your data”

It’s been an amazing journey with Hadoop.

As we discussed in an earlier blog, Hadoop is informing the basis of a comprehensive data enterprise platform that can power an ecosystem of analytic applications to uncover rich insights on large sets of data.

With YARN (Yet Another Resource Negotiator) as its architectural center, this data platform now enables multiworkload data processing across an array of methods, from batch through interactive to real time. And it’s supported by the key capabilities enterprise data platforms need – governance, security and operations. Read More »

Post a Comment

Overcoming the IT-Business divide in an era of big data, Part 2

In the first part of this series, I described the new challenges that IT departments face today. Collectively, they make it unreasonable for IT to act as the traditional gatekeeper of enterprise information. That's not to say, though, that IT should just sit back and ignore the very data that employees use to make business decisions.

Far from it.

In this post, I'll describe how, more than ever, IT (or some equivalent entity) needs to act as a technology and data facilitator.

Read More »

Post a Comment

Overcoming the IT-Business Divide in an era of big data, Part 1

The IT-Business Divide is lamentably alive and well in many organizations.

You know what I'm talking about: that exhausting and inimical internal bickering between IT and everyone else about who's responsible for what. I would wager that thousands of intelligent articles, blog posts, studies and white papers have been written about bridging the traditional IT-business divide. (Thomas Redman penned a particularly good post for HBR a few years back.)

In the first of this three-part series, I'll examine this well-trodden issue against the backdrop of recent trends, particularly the rise of big data.

By way of background, I've seen first-hand the traditional IT-business divide on dozens of IT projects throughout my consulting career. Today, in many mature companies, that divide now resembles a growing chasm. Read More »

Post a Comment

Big data integration - A good starting point for data governance?

In the UK, technology trends move a little slower than for our US counterparts. It was about 5 years ago when I first met a data leader at a conference on this side of the pond who was actively engaging in large scale big data projects.

This wasn’t a presenter or big-name draw to the event. My "Big Data Scoop" was uncovered during a break-out coffee and danish session – a fertile ground for me to uncover new stories for Data Quality Pro. Read More »

Post a Comment

Data integration – Job skills required for success

Data integration, on any project, can be very complex – and it requires a tremendous amount of detail. The person I would pick for my data integration team would have the following skills and characteristics:

  1. Has an enterprise perspective of data integration, data quality and extraction, transformation and load (ETL):
    1. Understands data quality, data profiling and ETL tools.
  2. Understands the need for enterprise data management:
    1. Including data modeling for the enterprise and each data integration project.
  3. Understands database performance for load and retrieval of data:
    1. This should include indexing, partitioning and views.
    2. Reporting environment implementation.
    3. Propagating data to other systems (if necessary).
  4. Possesses the ability to write highly optimized SQL and/or consult with developer to achieve results:
    1. Once in a while we have to “roll our sleeves up” and help out.
    2. Code reviews and testing will be required.
  5. Participates in gathering and prioritizing the requirements:
    1. Collaborates in writing the scope, requirements and detailed technical document.
  6. Is a MASTER at spreadsheets:
    1. Mapping from one or more sources to a target requires documentation on the process, quality of the data, anticipated values and any other technical notes required for the data integration project.
  7. Works well with others in a complex, intense development environment.
  8. Possesses leadership skills:
    1. Working and delegating to other team members.
    2. Reporting progress to project managers and upper management (when required).
      1. PowerPoint is a MUST.

1439322506230Sounds like a SUPER person doesn’t it? Actually, data integration requires a super-person who understands the business needs and can articulate that information into technical documents. Not an easy job to fill, and it may take multiple people to accomplish these tasks. If you find someone with these qualities, hang onto this person. They are worth their weight in gold.

SAS is a leader in Gartner Magic Quadrant for data integration tools for the fifth consecutive year.

Post a Comment

Big data model convergence: Combining metadata and data virtualization as a collaboration tool - Part 2

I am currently cycling through a schema-on-read data modeling process on a specific task for one of my clients. I have been presented with a data set and have been asked to consider how that data can be best analyzed using a graph-based data management system. My process is to load the data, examine whether I have created the right graph representation, execute a few queries, and then revise the model. I think I am almost done with this process, except that as I continue to manipulate the model for analysis I notice yet one more thing about the data that I need to tweak before I can really start to analyze the data. Read More »

Post a Comment