In the third post of the 10 Commandments of Applied Econometrics series we discussed the issue of data exploration. Today, I will present the next commandments: keep the solutions simple and use model validation.
- Keep the models sensibly simple
Striking the right balance between simplicity and sophistication: the models created should be neither too complex nor exaggeratedly simplified. In too simple models, there is a risk of logical errors or incompatibility with data. It may turn out that they do not reflect the correlation and interdependency between variables accurately, and consequently, the entire available information is not efficiently utilised. If this is the case, the forecast produced by the model will be of low quality and the client will not be satisfied.
However, using too complicated models has many pitfalls too. If the specification is too rich – too many variables are used – the model may be overly specific to the estimation dataset. In consequence of overfitting, the quality of forecasts for this particular dataset will be very good but forecasts for new items or periods (out-of-sample forecasts) will be characterised by high instability and poor quality. More complex models often require more amounts of data and more advanced tools that lead to higher cost. In addition, they are more sensitive to errors and data inconsistency. It is also more difficult to interpret the results obtained and explain the logic of the model to a client who is not willing to rely in his business on forecasts provided by a “black box” without knowing how it operates.
In the practice of implementation of analytic models, much attention is paid to the interpretability and understanding of operation of these models. This is why models called “white boxes” are preferred and “black boxes” are avoided. Clients often choose (even at the expense of lower predictive power) simple and proven regression models, decision trees or exponential smoothing models instead of less stable and more difficult-to-interpret artificial neuron networks, random forests or ensemble models. When we create models, not only do we need to choose the functional form (method) but also make a decision concerning the selection of variables. When it comes to variable selection for a model, the theory of econometrics offers two basic approaches: bottom-up/ forward selection and top-down/ general-to-specific/backward selection. They lead to obtaining very similar results. The alternative approaches to the selection of variables – known as shrinkage methods – include ridge regression, LASSO or ElasticNet. They are explained and implemented in the LASSO Selection with PROC GLMSELECT demo. Experience shows that the best results are achieved with sensibly sipmle methods – or those that are simple but not naive. More on the topic can be found in the first post of the 10 Commandments of Applied Econometrics series.
- Apply the interocular trauma test
Today, in the era of continuously developing technologies and tools, researchers often produce very sophisticated results based on different models and sets of variables. It is advisable to take some time and analyse if they simply make sense. Are the results obtained logical? Are signs of parameter estimation (direction of the effect of explanatory variables on the dependent variable) in accordance with expectations?
One of the approaches to the problem is called interocular trauma test/ stupidity test. The idea is to look hard at the results until the answer to the above questions becomes obvious – hits you between the eyes. However, this subjective procedure should be carried out independently of the formal testing of the model quality and cannot replace it.
Sometimes results and findings of the data analysis are surprising and contrary to the conventional wisdom. If this is the case it is advisable to consult an expert – for example the end-user of the analysis. Then one of the two scenarios may materialise. The first scenario is the most common one: a programming error, a wrong assumption or a wrong understating in the analysis is detected. Then the model can be reviewed and improved. The second possible scenario is when a surprising result is positively validated and can be considered as true. It means that we have just discovered some new knowledge about the investigated phenomenon that can be then used operatively.
In the next post of the 10 Commandments of Applied Econometrics series, to be published soon, we will discuss another important issue: benefits and costs of using data mining techniques. Those interested are encouraged to read the original article by Peter E. Kennedy.
3 Comments
Pingback: Use Data Mining and Machine Learning with caution | Bright Data
Pingback: Use Data Mining and Machine Learning with caution | Bright Data
Pingback: Consideration is the key to success in working with data | Bright Data