In my previous two posts I introduced the need for data quality and big data and discussed the need for fit for purpose data quality and for stretching the limits of data quality for big data. In this final post on data quality and big data I’ll address the following:
- Architecture design can make or break your success
- Text, entity relationships & semantics play a critical role
Architecture design can make or break your success
It should go without saying that solid architecture is the basis for success. With big data, that goes double, because volume, variety and velocity can exacerbate problems, even minor ones. You should plan appropriately because you may need to support multiple architecture patterns to support big data, and the architecture patterns may be different in order to accommodate scale, diversity, etc. Here are some initial considerations:
Consider the architecture type: Since several different architecture patterns will be necessary to support a comprehensive big data infrastructure, the data quality approach must be designed to support multiple use cases. For example, with the advancement of grid, in-database, and in-memory computing, there are analytic use cases that can now be processed using the entire data set. In other cases, a “stream it, score it, store it” approach that leverages analytics up-front will be used to help identify relevant data that will be used in downstream analytics. And event style processing will also be used in big data scenarios. The data quality approach must be designed to fit the architecture – you can’t always assume an ETL style approach to analytics.
Design your strategy with flexibility in mind: New sources of big data will be coming at you, so your infrastructure needs to be flexible. Today it may be LinkedIn, then Facebook and Twitter… who knows what it will be tomorrow? With machine data, we have already witnessed an explosion of the number of devices that have sensors, its’ highly likely that the frequency and scope of machine related data will change over time –the number of devices that will be transmitting information, the frequency of emission, etc.
Consider file system support: From a relational database perspective, we are accustomed to working with file systems that support create, read, update and delete support (CRUD). With some of the distributed file systems that are commonly used in big data scenarios, such as Hadoop, the file system may not support updates. For example the Hadoop Distributed File System (HDFS) supports append only mode vs. transactional updates on existing data. This needs to be factored into the data quality architecture because data quality efforts that depend on update in place simply won’t work. One approach is to apply data cleansing to data that is flowing into or out of file systems that don’t support updates.
Text, entity relationships & semantics play a critical role
It’s probably not a surprise that text enters into the big data conversation, since unstructured data is commonly associated with big data initiatives. In addition to text, being able to effective link big data sources to transactional entities such as customer, product, etc., is critical if you want to derive the greatest value from your big data. And finally, meta-data and semantics can help you effectively manage big data.
Focus your efforts on entity relationships: Especially in work relating to social media or other forms of contextual or interaction data, a key component of the data governance, data quality or MDM process is the ability to correlate the big data source to the transaction or enterprise data. This allows you to relate specific customer feedback from social sources with internal customer data that is tied to product or service purchases. This approach can strengthen marketing analysis efforts since the analytics is correlated at the individual customer level vs. correlating broad segments of transactional and interaction data. And, remember that it doesn’t stop at the primary entity or customer… it is also necessary to entity match friends from the actual customer to determine if the customer interaction drives business with the customer’s friends.
Your big data efforts should consider text: In many cases, big data involves some form of textual or unstructured data. Quality issues that plague text from user entered data may apply to big data initiatives as well. The following examples represent typical data quality challenges relating to text that should be extended into big data environments:
- Identifying misspelled words or managing synonym lists for grouping similar items like “lvm”, “left voice mail”, “left a message”, etc., that may affect analysis.
- Social media data relating to Instant Message abbreviations and addressing different sets of terminology related to various industries and professions.
- Leveraging content categorization to ensure that the textual data is relevant. For example, filtering out noise in textual data relating to a company name: differentiating SAS Institute, SAS shoes, SAS the airline, etc. We have been involved in projects where only 38% of the data that referred to a portion of a bank name was actually relevant based on that term being used for non-related text strings.
- Utilizing contextual intelligence to discern meaning. For example differentiation between the person and the name of a hotel… “Paris Hilton walks into the Paris Hilton”. This should include the ability to factor this into count or summary analysis where it is necessary to delineate between the person and place.
Metadata and semantics are key: Leverage tools that will help you determine the semantics and meaning of textual data – combining this understanding will provide wide-reaching benefits. This is especially important when it comes to opaque big data, things like video, audio, images, etc. Although it is certainly possible to leverage audio to text conversion and act on text, in many cases, these opaque objects can best be managed through their associated meta-data.