The more data you can apply to a business problem, the better its potential solutions. While there’s no shortage of data available to your enterprise today, it’s often difficult to know what data you have and how it can be used. This is why you should never overlook the important role metadata plays in the data ecosystem. The ability of disparate data to connect and combine (even when it’s co-located in the same data lake or cloud repository) is largely dependent on the metadata the data shares. Data tagging is only one aspect of that – but it's a very important one.
Many people are familiar with tagging outside the context of enterprise data management. Blog posts (like this one), online articles, videos, photos, podcasts and social media are all examples of unstructured or semi-structured data that rely heavily on tagging to connect them to related material. Tags also play a big role in keyword searches and search engine optimization. We’ve all experienced disappointment when tagging was intentionally wrong and led us to click to supposedly related content – only to discover it was click-bait.
Within the context of enterprise data management, data tagging provides many benefits. For example, data tagging can:
- Help determine how much data preparation should be performed on new data sources.
- Enable efficient data discoverability – so when data is needed later for specific business purposes, it's quick and easy to locate the most applicable data.
- Improve big data quality, especially by making unstructured and semi-structured big data more usable.
- Help identify sensitive personal data so access can be properly managed and governed.
- Help flag and filter ethically dubious or otherwise questionable data before any of it is used in decision making or artificial intelligence solutions.
Let’s examine four data tagging best practices.
Standardize the tags
Data tagging is a subset of the essential metadata that makes up a business glossary. The business data term list in a business glossary provides an authoritative vocabulary that promotes a common understanding between stakeholders in an organization. Without establishing standard values, tagging often produces homonyms (i.e., the same tags used with different meanings) and synonyms (i.e., multiple tags for the same concept). These can lead to inappropriate data relationships and inefficient searches for data about a particular subject.
Use all applicable tags
As with a lot of metadata management tasks, you can try to skate along – doing the bare minimum by only applying one or two tags. But since most data can be used for multiple purposes, it’s important to use all applicable tags. Doing so may reveal unexpected uses. It could also identify business groups most interested in tagging a particular source – which might make that group a logical candidate to be its data steward as well.
Don’t over-tag
I know this recommendation sounds like a contradiction of the previous data tagging best practice. But tags can lose their significance if you fall prey to getting too carried away with applying tags. Frequency distribution analysis of tag values, both individually and in various combinations, can help prune extraneous tags for optimal efficiency. This analysis may also help further standardize the tags by revealing that a frequently used combination of tags should be available as an additional standard tag value. That is sometimes better suited than assigning multiple, individual tags.
Re-evaluate tags over time
It’s important to remember that business terminology and business context rarely stay static. While many tags remain applicable for a long time, don’t assume they will always stay that way. Plus, if tagging doesn't consistently deliver the benefits described above, then investigate why. You may find that you need to re-standardize and re-apply your existing tags.
Tag. You’re it!
Have you had any experience with data tagging? Share your experiences and recommended practices with the community by leaving a comment below.
Read about how SAS can help you prepare data for analytics