I can still remember the emergence in the 1990s of Java as a universal and largely open-sourced programming language. Some observers believed it would signal the eventual demise of proprietary languages such as C, Fortran and Cobol. Current surveys, however, and there are many, confirm that plenty of people continue to use “legacy” languages.
I mention this because it may be helpful in the analytics context, as organisations embrace newer languages. At this stage, I should explain I’m neither a data scientist or developer (though I confess my first job was as a PL/1 programmer) so these views are firmly those of an enterprise architect. I also concede that scripting to deliver analytics can be very different from traditional application development.
As an example, I recently met an organisation that commented on the abundance of graduate talent with exposure to newer languages such as Python and R, and is now seriously considering standardising onto a single programming language. It has a large existing SAS team and you must wonder what is to become of the decades of knowledge and experience they possess, and the man-years of SAS code that is running the business today. I’m reminded of Voltaire’s observation “Don’t let the perfect be the enemy of the good”. For anyone considering standardisation, I offer the following questions:
- Can a single programming language satisfy every requirement?
- If you have teams with existing skills and experience in proprietary languages, does it make economic sense to arbitrarily junk this in favour of the latest open-source language?
- Should you really be encouraging coding anyway, if there is a GUI (graphical user interface) available?
- Rather than dwell on the attractions of a language and its IDE (integrated Development Environment), isn’t it more important to give your analysts access to the right data and capabilities and let them use whatever method(s) suits?
Putting aside whether it makes sense to standardise on a single language, there’s also a bigger picture that many organisations have yet to appreciate as they develop their data science capabilities. It is relatively easy to equip your teams with capable individual tooling, but the result may be quiet chaos. Empowered analysts will explore and process data to their hearts’ content, but in an environment that is diverse, complex and inefficient — and in the long term, this is not a sustainable topology.
Just as the data warehouse evolved to satisfy the need for centralised and governed data, I believe there’s an emerging requirement to deliver governance and managed capabilities for the analytics estate. At this point, data scientists will be tempted to roll their eyes, thinking this is a play for a locked down IT-managed environment, but that is not the case.
To make best use of an analytics estate
I think several things are needed:
- Convenient and accelerated access to common data to make life easier for your analysts. Ideally it should be delivered with high-performance and availability. I’m thinking here of core data (tables and extracts) and derived work (such as Analytical Base Tables). These may contain images, text documents, and videos as well as traditional coded data. Today these are typically either duplicated because each analyst has their own version, generation and copy, or they are stored formally in an RDBMS or informally in shared storage.
- Data management capabilities: supporting cleansing, standardisation and the ability to share useful transforms and snippets across the analyst community.
- Flexible and centralised engine support for a range of scripting languages (the obvious ones being Python, R and SAS, for example).
- A recognition that analytical models are assets and should be treated as such, held in a repository with accompanying governance relating to monitoring, reporting and publishing.
- A framework for publishing models into the operational context, which includes batch, online (including REST) and real-time.
This list may be incomplete but it’s a good start. This, therefore, is my recipe for an Enterprise Analytics Platform. Ideally it will be delivered using unified infrastructure, but realistically, it’s more likely to be an ecosystem with a handful of components that together deliver these capabilities. You also need the ability to deploy these capabilities on-premises, into private, public and hybrid cloud, and using containers.
Is this pie in the sky or do you have one of these already? Answers on a postcard, please!