Do you have a big data project that lacks resources? I do! But as fast as the technology changes, how can we expect to find people with the skill sets required for all of these new tools, like:
- Apache Pig.
- Apache Hive.
- MapReduce.
- Podium.
- Apache Sqoop.
- And so many others...
We can't expect people to be experts with each of these tools while staying abreast of new technologies that change every other week. But because the technologies do change frequently – and new players do continue to show up in the sandbox – the gap in our skilled resources continues to widen. And since everyone can't be proficient at everything, it's worth noting that the same profiling, quality, ETL, reporting and analytics tools that we know and love are quickly becoming part of our big data world.
Some companies choose to create a dependency on having more people to manage their big data projects because they feel the need to use the newest, shiniest tool on the market. But I choose to look at what products a company uses all the time versus new products that are on the horizon.
New tool – or existing tool?
If a company has specific tools and skill sets, but those tools don't work for big data, then it's time to consider newer technologies. Or you may need to consider newer technologies when they meet a specific requirement that you can't meet any other way. What we do not want to do is create 40,000+ lines of code to maintain like we did prior to the creation of these wonderful tool sets and data assets.
It's clear that we have the same issues on our big data platform that we've always had in the past in our data warehousing or operational platforms. Data quality, data profiling and monitoring is still required for big data. Governance and movement of data from one place to another is still required. Understanding usage based on performance is still required. Metadata surrounding all of this is still required. We basically still move the data and use the data, regardless of its size.
Maybe – instead of always looking for more resources with the skills to use new tools – we should consider using existing tools, resources and vendors that we already have experience with. For example, if we already use a data quality/data profiling tool, then why not consider that tool for big data, too?
Knowing that the tools and corresponding skill sets will change, I still prefer to have people on my team who understand technology and the data requirements in general and are always willing to learn something new. They’ll figure out the rest as we implement new platforms.
Generally speaking, everyone has a tool that works on big data platforms. So before you shop around for newer tools, consider using what you have. Ask yourself the following:
- Do we need to pay more for tool expansion onto the big data platform?
- Does this vendor offer a big data solution? And if so, how much more do we need to pay?
- Do we already have resources that know how to use this tool?
- How much training will they require?
- How many resources will be required to use these new tools?
- How much will these resources cost?
- Does the new tool demonstrate governance and compatibility with the data management practices already present at my organization?
- If not, what has to change?
My preference is always to work with the tools and resources a client already has before I suggest new tools or more people. Maybe an extension of an existing licensing agreement would work, or perhaps they could partner in a project with the vendor. These approaches can save organizations time and money as they venture into new platforms and new ways of thinking about data.
Download a white paper about big data governance, data management and analytics