In the fourth post of the 10 Commandments of Applied Econometrics series we discussed the issues of keeping the solutions sensibly simple and applying model validation. Today, I will present another commandment related to data mining techniques.
- Use data mining reasonably.
In the econometric community, data mining is a controversial and highly emotional theme. Until not so long ago, theoretical econometricians considered this approach as one of the most serious sins committed by analysts or those who use econometric techniques in practice. On the other hand, data mining supporters indicated its unquestionable benefits and advantages, particularly in processing of big volumes of data. So what is data mining? And why are opinions on it so polarised?
Generally, data mining is a process of analysing data from different perspectives with the use of statistical and econometric methods. Its aim is to uncover characteristics of data and links and connections between different variables. There are two views on data mining. The first and more heavily criticised variant of data mining refers to experimenting with (or fishing through) the data. In the second variant, the process is positioned as an important component of data analysis. Unfortunately, these approaches are not fully disjunctive, and this results in inconsistency of views on the subject.
In the context of data mining, it is also advisable to consider the concept of machine learning defined as methods of automated model building used in data analysis processes. Both data mining and machine learning are becoming increasingly popular, thanks to more advanced and powerful technologies that allow for working with Big Data.
Data mining, undertaken as an experiment with data in order to discover empirical regularities that throw new light on the theory of economics, is often combined with Exploratory Data Analysis. The process, aimed at finding interesting or valuable information in Big Data, may help to identify errors made during the development of the theoretical specification. Applied econometrics is the art of discerning valuable theories based on data from regularities that are not worth consideration. So the results must be validated in the context of the theory underlying the investigated phenomenon to minimise the cost related to the use of data mining. It helps to avoid a situation where a researcher “discovers” new regularities on the basis of a dataset with specific characteristics and then generalises them, unjustifiably, to the whole population. In general, the ultimate specification of a good model should combine the theory of economics and business knowledge with interdependencies discovered in the data analysis. It is also important that the usable model be, as far as possible, economic, reliable and informative.
Data mining as a part of the data analysis process can be used to automatically identify the best model specification. However, the use of the set of data for this purpose has one major drawback. The obtained functional form, selected variables or their significance will be heavily dependent on the specific dataset on which the researcher works. This may lead to wrong understanding of the process underlying the generation of data or factors affecting the target variable. Furthermore, traditional procedures used for specification testing will not provide reliable results in this case. When a specific dataset was used to develop the specification it may not be used again as the basis for testing the specification adequacy (the result of such a test would be biased).
The challenges to be faced when applying the data mining approach include also the instability of obtained models. Decision trees are a typical example here. All these models are characterised by high sensitivity to any changes in the data – even a minor change in the training dataset can mean that different variables are selected in the decision tree and, consequently, the obtained results can be also different. Take for example a decision tree used by a bank in a customer scoring system where customers are ranked according to the probability of loan repayment. The scoring is calculated on the basis of characteristics selected as relevant in this model. If we change the training dataset – exclude some customers from the analysis or add/remove explanatory variables – the tree can be expected to indicate completely different relevant variables. Consequently, the scoring for a customer can change even if their attributes have not changed at all. Of course, such models are unacceptable in a production environment. In order to minimise the instability effect, estimation is often done with several models and forecasts are averaged. Alternatively, more complex models can be built – for example random forests that combine several or even hundreds of single tree models.
So what should be the focus and how to benefit from data mining techniques?
- Model specification should not be a result of blind belief in the accuracy of testing procedures but a well-thought-out combination of theories and results obtained from the data themselves.
- What should be avoided is assessment of models on the basis of controversial criteria such as maximising the coefficient of determination (R2) value on the training dataset – because the value of the coefficient increases as new explanatory variables are added even if these variables have no significant effect on the examined phenomenon.
- Variable relevance and adequacy tests should be designed so as to minimise costs related to the use of data mining processes.
- By dividing the available sample into the training dataset (for model estimation) and the test dataset (for specification testing) creating and testing a model with the same individuals can be avoided.
- A similar effect can be achieved by using a completely different dataset for the forecast. This method is known as the out of sample For example, we can build a model on data for the years 2010-2014 and test the forecast quality on newer data for 2015.
- The correction can be also applied during the process itself – by adjusting the significance level to the multiple testing problem , by using for example the Bonferroni correction, which has been implemented in SAS as well.
Today, the data mining dispute can be equated with the dispute about the source of model hypotheses/specification. Data mining consists in exploration of data in order to develop the model specification; in econometrics, the specification is based mainly on the theory of economics and is often known in advance. In practice, these two approaches are often combined – data mining becomes a source of business hypotheses while statistics and econometrics are used to formally validate and operationalise the solutions.
Let me conclude this post with a short summary of the key temptations awaiting the apprentices of the art of modelling who use machine learning and data mining tools:
Temptation to use tools and results in a quick and uncritical way – With easy access to tools that enable quick modelling, researchers may be tempted to use them immediately in business. However, it is not a recommended approach. At first, the method used and the data should be analysed, the assumptions and approach should be validated. In this way we can check if the line of reasoning is correct, detect any errors and consequently make more of the insight from the interpretation of results.
Temptation to generalise the results obtained beyond the modelled sample – If the sample is not representative, the interpretation should concern the data on which the model is based. Thus, the available data should be reviewed for their representativeness and sample randomness to minimise the risk of drawing wrong conclusions from the analysis. The problem of reject inference in credit scoring is a good example here. Read more in the second post of the 10 Commandments of Applied Econometrics series.
Temptation to neglect model instability – Models with specifications based on dependencies in empirical data are characterized by instability as a reflection of the stochasticity and complexity of the world. Dependencies present in one set of data need not exist in a different dataset – and because of that a model built on a different sample will probably be different. This is unavoidable but you should be aware of it and make a competent business decision: is the instability acceptable or do you prefer more stable and general rules at the expense of, for example, a lower accuracy of forecasts.
In the next post of the 10 Commandments of Applied Econometrics series, to be published soon, we will discuss another two important issues: being prepared to compromise when implementing solutions based on the theory of statistics and being mindful in the interpretation of statistical significance of variables. Those interested are encouraged to read the original article by Peter E. Kennedy.