In my last set of posts I have been noodling on the meaning of “data quality” in the context of the massive amounts of unstructured data, and then the question of what is meant by “data quality” in reference to social media “big data.” I believe this is a more complex question than can be addressed in one set of blog entries, but we can look at one of the aspects based on the use of content to set context (i.e., using hashtags that themselves carry meaning).
A lot revolves around understanding the organization of these contextual bits and pieces. I originally motivated the discussion around the overloaded use of the same term that had meaning in two different communities. Interestingly, though, since tweeters don’t always limit themselves to a single hashtag, anyone really interested in extracting knowledge from the aggregation and consolidation of twitter feeds would organize relationships and hierarchies among and across the posts using the different tags. As an example, the #MDM tag may appear frequently with a #DataQuality tag, which (of course) would usually distinguish the content within the master data management context (and not the mobile device one). Don’t worry about the absence of standards, either. The standard is effectively crowd-sourced; those hashtags that are meaningful and relevant to the community will be used and reused, while the ones that don’t fit the bill will eventually be left unused.
So we have a direction for organizing our contextual cues and determining their relevance. These ideas together help address at least one part of the big data quality challenge, namely the imposition of meaning in a consistent way, resulting from analyzing the self-organization around what amounts to a very fluid underlying knowledge base. Perhaps we will explore this in greater detail in an upcoming set of posts.