Data science in the wild: On the home stretch

“It doesn’t stop being magic just because you know how it works.”
Terry Pratchett, The Discworld Series

Welcome to the third, and final, installment of Data Science in the Wild. In Part 1 we were lost in the woods thinking about how to start a data science project. In Part 2 we found our way through the decision trees as we considered data and modelling strategies. And in this final part we’ve come out the other side, homeward bound, as we make some final considerations around AutoML, model interpretability and solving business problems.

AutoML

Someone once told me “a little information is a dangerous thing,” and it really rings true with leveraging AutoML capabilities. Simply throwing your data at an AutoML algorithm can yield success, but it can also lead to spurious results and incorrect conclusions. This can be dangerous when you are trying to use predictive analytics for business decisions. It is important to establish how well the model is working, and whether it is a good and representative fit to the data.

I must admit, I do use AutoML regularly. It can save you a lot of time, and when it comes to fitting a predictive model, an awful lot of the grunt work can, and should, be scripted anyway. A good example is hyperparameter tuning since there is no obvious way to set optimal values without some trial and error.

In my experience, AutoML works best when it is guided by a data scientist. This applies whether it is using a GUI interface or an AutoML API through R or Python. APIs that give open source developers a platform to run AutoML from any client are particularly powerful. They make it easier to build models from your preferred IDE whilst continuing to use your favourite open source libraries for managing and visualizing data sets.

If you want to learn more about this, there’s a fantastic article here by Sophia Rowland where a SAS data scientist goes toe-to-toe with SAS Viya’s AutoML capabilities.

Model Interpretability

One of the points we briefly touched on in the last post was the trade-off between model explainability and performance.

One of the virtues of traditional models is their ability to be interpreted by data scientists, explained clearly to the business, and be easily audited.

When we use more complex machine learning techniques, such as deep learning and gradient boosted models, we have what is commonly referred to as a "black box" model. We know what goes into the model, and we get an output, but the bit in the middle is rather fuzzy. These models are difficult to audit or explain to the business. They can also introduce bias into the model if there is an imbalance in the training data or features are included when it may be unethical to do so. This applies equally to both manually created and AutoML generated models.

What can you do to solve these problems? Defining performance metrics and thoroughly testing your model can go a long way towards making your models robust. You can also use explainability methods such as LIME for black box models to make them easier to interpret. It can also help to define clear metadata for your models, rather than just commenting in your model source code.

(Actually) Solving a Business Problem

You may hear people say that algorithms are effectively a commodity now. This really means it's more about how you use them as part of a wider end-to-end deployment. Having a classification or regression model is great, but that can already be done in many ways. The focus is really how you make the model accessible to the business process and drive actionable insights. In other words, predictive models have their use – but need to be part of a wider analytics life cycle.

This is an important point because a data science project should be started to solve a business problem. It is therefore important to not only build a good model but to also get that model working for you.

An Approach for Success

Don’t avoid using AutoML but instead embrace it as an autopilot. An autopilot will help make your life easier, but it won’t replace the need for a seasoned captain to monitor and guide it.

Being able to explain your model is no longer a nice-to-have but an important part of creating a useful and trustworthy predictive model.

Finally, data science projects should exist to solve business problems. The business problem, therefore, drives the project and the choice of strategy – and not the other way around.

Conclusion

I hope this series has been an interesting read for you. In future posts, we’ll go into more technical depth on how you can scale your open source analytics with SAS Viya.

Learn More

SAS has some fantastic introductory courses for statistics and machine learning on Coursera.
You can get free access to the SAS Data Science Academy for 30 days.
Professor Andrew Ng’s Machine Learning Yearning’ is a must-read for data science project planning.
You can find out more about how we are deploying data science in the public sector here.

Blogs