More and more organizations are considering the use of maturing scalable computing environments like Hadoop as part of their enterprise data management, processing and analytics infrastructure. But there's a significant difference between the evaluation phase of technology adoption and its subsequent production phase. This seems apparent in terms of how organizations are embracing Hadoop. Clearly, there are many people who are kicking the tires at this point – installing the software stack and checking out the various tutorials and example applications. But only a more selective crowd of organizations have moved past the evaluation stage.
It's interesting to reflect on the preliminary results of a recent survey our analyst firm is conducting. One question asked about the types of capabilities organizations were trying to improve through the use of Hadoop. The top three responses were “predictive analytics,” “data lake,” and “data warehouse augmentation.” While predictive analytics was the most popular choice, it is clear that the data lake and data warehouse augmentation are much more “grounded” uses for the Hadoop adopter for two reasons:
- Both the data lake and data warehouse augmentation are simple applications that can be used to take advantage of the Hadoop ecosystem without a high level of Hadoop sophistication.
- Implementing these solutions provides rapid time to value, solving acute resource needs with a low-cost alternative to more costly data warehouse appliance solutions.
Let’s look at the data lake concept in particular. According to TechTarget, a data lake is “… a storage repository that holds a vast amount of raw data in its native format until it is needed.” In Hadoop, data files are stored as distributed objects across a cluster of data nodes using the Hadoop Distributed File Systems, or HDFS. Files can be moved into HDFS from local storage and then accessed using other Hadoop tools.
When the data in the HDFS file is structured, it can be accessed and queried using some of the SQL-on-Hadoop tools such as Hive, Impala or Stinger (among others). On the other hand, if there is no imposed structure on the HDFS file, you won’t use an SQL-type tool to access the data – but you can still use search tools (like Solr), or write custom applications to read, process and store data.
All this is well and good. And all the standard tutorials demonstrate both the straightforward methods for manipulating files and accessing the data. The challenge is this: what happens once you have traversed through the inventory of training materials and want to start dumping all your data into your data lake?
Yes, in the most general sense, that's what a data lake is – a place to “dump” your data (in its native format) until it's needed. And if you don’t know when it will be needed, you probably won't know how it's going to be needed, or what needs to be done with it. Simply put: while the motivation behind the data lake is sound (leverage low-cost storage and keep everything), the philosophy behind the data lake defers good management decisions until the time of use (keep everything until you need it).
The problem emerges as you try to scale. It's one thing when data lake contents are presumed to be manageable (a small number of massive-volume files). But scaling the data lake becomes much more complex when it becomes the general repository of all files of all sizes. Yet this is what organizations are doing – scanning all the computing systems, identifying data assets and moving them to the data lake. The most significant issue is that this ungoverned inventory will become unmanageable as more data assets are dumped into the lake.
In my next post, we'll examine two key ways to reduce the complexity of data lake management while more effectively modernizing the data infrastructure.