In my last blog post, I talked about the importance of establishing the right team for data science projects. Here, I’m going to talk about some of the barriers that can prevent successful adoption of data science. You can read my whole "data science in the wild" blog series here.
Lack of strategy definition
Data scientists tend to work on loosely defined business problems that involve messy data combined with computationally intensive machine learning algorithms. It is, therefore, important to establish a data strategy.
Enterprise data strategy is at the heart of production analytics. Before beginning a data science project, you need to understand your deployment strategy – or deployment won’t happen. For example, will you be scoring data via an application in real time, or will you be batch scoring via an API? Will your analytics be at the edge, or is the focus on presenting the results via an interactive dashboard? These questions matter.
Practical and technical problems
Technical debt can also be a barrier to successful adoption of analytics and can be a catalyst for adopting a pure open source strategy. However, it is important to consider scalability at the outset of any project in order to ensure an ROI.
When considering a purely open source project, there are additional technical and practical considerations for project success. For example, do enough of the team programme in the same language? What will you do for version control of libraries? How will nonprogrammers contribute to the project? In what way will you scale your deployment? How will you manage model decay?
In practice, I’ve seen that blending the agility of open source with the scalability of proprietary software often yields the best results. A modern, centralised analytics platform should allow teams to work natively in many languages whilst also enabling nonprogrammers to contribute.
It is also pertinent to note that legislative expectations, such as GDPR, are more prominent than ever. It is important to consider where data will be stored, for how long and whether it is personally identifiable information. This leads into a much broader conversation around ethics and governance in data science, which deserves a whole blog post in itself.
Do you really need a model?
Before asking what kinds of models you want to use, it is worth asking if you need a model at all. The goal of data science is not simply building statistical models, but rather gathering, understanding and gaining value from large and messy data sets. In many cases, beginning with just descriptive statistics or visualizations can add a lot of value to a project.
What kind of model do you need?
Once you get into predictive analytics, it is prudent to consider the types of models you will use. For example, you may want to consider whether you will prioritise deterministic or nondeterministic models. Deterministic models, such as linear regression, generally fit their models with global optima, meaning that when the models fit, it is the best fit for that data. This means that if you were to rebuild your model on the exact same data set, you would get the same model weights.
Nondeterministic models, such as neural networks, may better learn complex relationships in data. However, these types of models will generally be more computationally intensive to fit. And where model weights are fit via optimization methods, the model may not converge on global optima. This means that if you were to rebuild your model on the exact same data set, you may not get the same model weights.
The goal of predictive analytics is to produce a model which will generalise well to new data. As models increase in complexity, they better fit the training data set and appear more accurate – but may fail to generalise well to new data. This delicate balancing act between overfitting our data with a complex model and underfitting our data with too simple a model is known as bias versus variance.
Picking a model strategy is, indeed, a complete minefield. We haven’t even touched on other major considerations like missing data, dimensionality, standardisation and transformation, feature selection, feature engineering and many, many other points.
Fundamentally it is important to note that a more complex model will not always yield the best results, and there is no silver bullet for tackling the beast of messy data.
An Approach for Success
Broadly speaking, picking a model should aim to balance the level of accuracy you need against your model explainability and computational expectations. In practice, it is a good idea to try many different models. Typically, I try to start by fitting a simple baseline model and increase model complexity from there.
- SAS has some fantastic introductory courses for statistics and machine learning on Coursera.
- You can get free access to the SAS Data Science Academy for 30 days.
- Professor Andrew Ng’s Machine Learning Yearning is a must-read for data science project planning.
- You can find more information about how we are deploying data science in the public sector here.