This is the fourth post in my series of 10 machine learning best practices.
It’s common to build models on historical training data and then apply the model to new data to make decisions. This process is called model deployment or scoring. I often hear data scientists say, “It took my team weeks, or months to deploy our model.” Sometimes, after all your hard work, some models never get deployed.
Each model includes a lot of data preparation logic. You have to aggregate many data sources, include the model formulae, and layer it with rules or policies. To summarize, a scoring model is comprised of data preparation + rules + model formulae + potentially more rules.
The data preparation logic is essential for scoring. This data wrangling phase – which includes defining all of the transformation logic to create new features – is typically handled by a data scientist.
Then, to deploy the model you have to replicate the data wrangling phase. Often, IT completes this task to integrate the model into a company’s decision support systems. Unfortunately, most organizations don’t have enough rigor and metadata to re-create the data wrangling phase for scoring. As a result, many of the backward data source dependencies for deriving the new scoring tables get lost. This is by far the biggest reason why most organizations take too long to put a model to work. How can you avoid these frustrations?
To get models into production, implement best practices for managing predictive models in a production environment, including:
- Determine the business objective.
- Access and manage the data.
- Develop the model.
- Validate the model.
- Deploy the model.
- Monitor the model.
One tip is to use tools that enable you to automatically capture and bind the data preparation logic including preliminary transformations with the model score code. The data engineer or IT staff responsible for deploying the model then has a blueprint for implementation. He/she does not have to piecemeal back together data engineering and algorithmic steps which is a huge time savings. The data scientist should participate in running initial scoring tests prior to putting the model into production.
Advanced enterprises are also creating standardized analytical data marts to foster replication and reuse of data for analytics. Data scientists and IT can then work collaboratively to harvest directly from these analytical data marts to build and deploy more models faster. You can tune this process to get to the point where some common modeling efforts are run like a model factory. It is important that the data scientist contribute new feature logic to the data marts. You don’t want to hand cuff your data scientist by not letting him/her pioneer new features at the detailed data source levels.
If there are other tips you want me to cover, or if you have tips of your own to share, leave a comment here. My next post will be about autotuning. If you missed any previous posts, click the image below to read the whole series.