Data science is hot. Being a data scientist has been described as one of the world’s sexiest careers – though that may say more about the people coining the description than data science – and data scientists are both in high demand and increasingly well-paid. It is, in fact, highly lucrative to be able to use statistical techniques to manipulate data and generate insights. If you can present and explain those insights to others, you are very definitely able to name your own price.
A new energy to data management
Increasingly, however, those in the data business are noticing another term emerging. Data engineering is mentioned more and more often as an essential adjunct to data science. It may not be quite as sexy, but it is often given equal weight in terms of importance. But what does it mean?
There is no absolutely agreed single definition of data engineering, perhaps because it is such a new discipline, but it seems clear that it is about data management. Data engineers, in other words, deliver the data to data scientists in a form and way that allows the data scientists to generate insights. They are, for example, responsible for data collection, storage and cleaning. They are also, in many organisations, the people chiefly responsible for assets such as data lakes and analytics platforms.
Defining the role of a data engineer
A data engineer’s role might include:
Building massive reservoirs for big data.
A good platform now offers far more data processing power than has ever been possible before. This means that analytics processes can draw on more data – but only if it is available and stored in a form in which it can be used.
Managing data and developing data set processes
Data lakes need ongoing engineering and management to prevent them from becoming swamps, which can happen extremely quickly. Reliable data set processes and other engineering work ensure that data is available and in a usable form. It is much harder to drain a swamp than flood a reservoir!
Managing and monitoring data quality
Data engineers are the primary people responsible for data quality in the organisation, and at both the macro (data lake) and micro (individual data) levels. Their role is to ensure that data is ready for use – which means clean, in a suitable format, and reliable. A good platform will assist in this, because it will provide reliable and efficient access to data, and also allow continuous cleaning and other management.
Developing, constructing, testing and maintaining highly scalable architectures
The best platforms offer scalability as standard, and can expand as and when required. Data engineers ensure that the platform is maintained in such a way that it operates efficiently and effectively at whatever level is required.
Researching data acquisition and finding new uses for existing data
As more and more sensors and devices are connected to the Internet of Things, there is more and more data available. New sources often bring new opportunities, but existing data can also be combined in new ways to generate new insights, if it is made available.
Ongoing integration of new and improved data management tools
A good platform will not just allow, but actively provide, new data management capabilities, ensuring that they are integrated and work well with existing tools and capabilities. They will also work across a variety of languages for maximum flexibility and to bring different technologies together. Data engineers have a key role in maintaining this open architecture.
Actively managing risks to data, and ensuring that there is a recovery plan
Good data engineers are active risk managers, looking ahead and putting contingency plans in place to mitigate any disaster. They also ensure that recovery plans are clear and easy to kick-start if the worst should happen.
Working closely with data architects, data scientists and those responsible for modelling
The best way to ensure that data is fit for purpose is to work closely with those who use it, which means data scientists and modelers. A good platform will provide an environment for close collaboration, but the best engineers will actively seek out feedback on this issue.
This is only the beginning
This list of tasks, which is by no means exhaustive, goes some way towards explaining why data engineering is increasingly seen as essential. Some estimates suggest that a data science project needs at least one data engineer to each data scientist, and I think this may be an underestimate. Expect to see much more about this group in future, as data engineering rapidly becomes the new hot topic.