When I first heard the word metadata, I didn't get it. This was in 1999. I was new at SAS and writing an article about data warehousing. The definition, "data about data," meant nothing to me.
(Plus, I've always had trouble with the meta prefix in general. Go read the meta entry in wikipedia. It includes terms like metamemory, metaemotion and meta-answer ... and that's not even getting into metaphysics.)
Now it's 9 years later, and the social Web is turning us all into metadata experts. Anyone who's ever tagged a blog post, created a flickr set or categorized a wiki page has had their hands in the metadata of the Internet. If you've gone a step further and created a repeatable system for the tagging or categorizing of your posts and photos, you've been introduced to data governance (albeit on a very small scale).
So, suddenly I understand these terms - metadata and data governance - in a way I never have before. And I can see now how data governance can become a tricky business. On this blog alone, I've created more than 100 unique tags for the content in my posts, but the words I've chosen to use as tags and the consistency with which I use them is erratic to say the least. Then, I forget to tell fellow contributors about the system for tagging their posts on this blog, and their entries go up without any tags at all … or with categories instead of tags.
You can see a list of this blog's most frequently used tags over there in the right-hand column. But since they're not used consistently, you might not really see all the posts about, say, unstructured data when you click on the unstructured data link. This is bad data governance.
Now that text mining is in the news again, I'm wondering about the benefits of text mining over tagging. I understand that it’s not necessarily an either-or choice. But if you had a good text mining program, wouldn’t it automatically do everything your tagging system does – and more. Wouldn’t it reduce or at least transform your need for data governance?
I’m just a writer here, remember. Not a technology expert. I’ve never touched an actual data warehouse, never read a star schema. But sometimes the novice perspective is useful. Or at least I hope it is.
I was talking with Gaurav Verma last week about the Teragram acquisition and his thoughts on what the marriage of SAS and Teragram technologies could mean for the future of BI.
We got to chatting about these e-mail alert systems we all use. Google has them. Teragram has them. Lots of BI products even include alerts right on your dashboard that notify you about certain topics. But Google alerts are set up for specific words that you’ve identified. Text mining technologies could be used to alert you about specific concepts, ideas around concepts, or even connections between concepts that you’ve never considered. These alerts could notify you of developments in external data sources like news sites, blogs, social media sites, etc. Or they could be set up to alert you of new information in internal documents such as insurance claims, customer complaints, warranty claims and more.
Then, Gaurav and I started talking about document sharing within an enterprise. One of the systems we all use for sharing documetns is SharePoint, and we all complain about it. It’s not intuitive. It organizes information into buckets that don’t make sense. None of the individual SharePoint sites are connected. The list of complaints goes on.
What if, instead of using SharePoint, we each had our own public document stream. (I'm thinking of my flickr photo stream again and how it's organized. Social media informs my ideas on all this stuff.) Maybe we tag the documents in our document stream (which would require data governance at the corporate level). Or – if our text mining capabilities are good enough – maybe we don’t need to tag them. Maybe the important key words and concepts and relationships between those concepts will be recognized by the system. So we don’t even have to create folders or lists to separate documents in our streams, they just organize themselves into the appropriate areas automatically, based on the content within them.
So then, if my document is related somehow to a document you added to your stream yesterday, those documents will be automatically associated in the system, and anyone who’s interested in the common concept within our documents would be notified that these two new documents are in the system and should be examined together.
I keep saying documents, but really this corporate stream of information could also include reports, spreadsheets, photos, video, etc. It could be everything.
Then, if someone goes in to search the public document stream for, say, fourth quarter 2007 sales totals for Central America, they would get that number, not a list of documents. But they would also get a huge list of additional information they might find interesting. The results are smarter than standard search results, because you're not just getting a list of documents that contain the exact terms you've entered. You're getting results based on related concepts that show up in the various documents.
This is knowledge management. This is text mining. This is true enterprise search.