In the fifth post in this series we discussed the issues of the use of data mining and machine learning techniques. Today, I will present other commandments related to being prepared to compromise when implementing solutions based on the theory of statistics and being mindful in the interpretation of statistical significance of variables.
- Be prepared to compromise
The preparedness to compromise is generally about being aware of some limitations when it comes to the application of econometric theory and the ability to handle them. The theory delivers standard solutions to standard problems but in the real world, we deal with non-standard problems - each of them is different and specific. Only the available standard solutions remain unchanged. Thus, in applied econometrics, the challenge is to modify standard solutions in such a way that a non-standard problem can be solved.
A researcher must be prepared to make a decision that usually boils down to selecting a lesser evil. It is a trade-off decision. We are often forced to find the golden mean between the complexity of models and the quality of forecast they deliver, between the accuracy of results and the processing time, or between the speed of the method and the hardware resources. A researcher must answer also to some other questions such as: “Is the sample bias negligible?” or “Can I trust the results of statistical tests in this specific case despite the fact that not all theoretical assumptions are satisfied?”. A good practising econometrician must have extensive knowledge of statistical and econometric theory and understand the methods used to apply them purposefully, not mindlessly.
- Do not confuse statistical significance with the revealed truth.
The statistical significance of variables is a characteristic calculated for a specific dataset by determination of the test statistics value and the p-value (probability value). It can be equated with the probability of making the type I error when the null hypothesis is rejected although it is actually true. The decision whether a variable is considered as significant in a model or not is not dependent exclusively on these values but also on the expertly assumed significance level (traditionally 0.05) that describes the highest accepted probability of making the type I error. So this is a measure that can be easily manipulated. Moreover, it is based on a series of assumptions about the distributions of variables. This is why the approach where significance tests are used to “sanctify” and confirm the theory is heavily criticised as used too often and inadequate.
Practice shows that the significance coefficient itself is not enough to obtain a good quality model. In econometrics, the success factors include: good dataset, shrewdness and critical assessment of results, common sense, knowing the theoretical basics, logical reasoning, knowledge of the historical perspective, business knowledge, proficiency and experience in the application of methods. So the statistical significance can be one of the factors considered when selecting variables for a model, but it should not be equated with the actual relation or relied upon as the single and decisive criterion. It is more and more often indicated that instead of testing the significance, the closeness of the estimated and actual values of the parameter should be tested, and that the significance level should be linked to the size of the sample on which the model is estimated. Moreover, it is recommended to make comparisons against other models explaining the investigated phenomenon and analyse the results with the use of common sense. It is also possible to use the methods of automatic selection of variables such as backward, forward and stepwise selections or LASSO and LAR regressions (as described in more detail in the fourth post of the 10 Commandments of Applied Econometrics ).
- Present the path that led you to the final model.
It is a common practice that when discussing and presenting the results of analyses or models some aspects of the research process or its steps are passed over, deliberately or not. It is usually a consequence of limited time available, length of text or considering some details as insignificant or irrelevant. However, such approach may lead to an erroneous perception of the adopted methodology and, consequently, the stability of results and conclusions. In such situation, it is difficult for the customer to assess the level of uncertainty that should be assumed for the content presented to them.
Data mining, data processing and modelling are not about following a single pre-defined and automatic path. Each case is different, and it must be analysed and approached individually. Moreover, as of today, the human factor cannot be excluded. While the model itself can be built and “trained” automatically with the use of dedicated algorithms and processes, the preparation of data, selection of the best specification and the business analysis reflect the analyst’s knowledge and experience. So a small amount of subjectivity is unavoidable in the creation of a good and comprehensive solution. When the same set of data is presented to two analysts it is very unlikely that they obtain exactly the same results and conclusions. This is because of the fact that different methods of estimation, imputation of missing data, sampling and modelling can be applied, and solutions can be assessed for different aspects and against different criteria. There is nothing wrong in it as it creates the opportunity to implement innovative ideas, follow new paths and, consequently, to uncover new and interesting relations. However, it is not an issue that can be neglected. The whole methodology used in the research should be described to enable the customer to individually and precisely decide if they agrees with the approach adopted and, consequently, how reliable the presented results are for them. Of course, it is impossible to describe and present the whole model creation process with all its complexity. In the era of data mining solutions, it is also difficult to demonstrate all intermediate model versions – but in addition to the final and ultimate models, it is advisable to present the path that led to obtaining them. With this information, the customer will be able to critically asses the presented results – and there is no good data analysis without some criticism!
This is the last part of the 10 Commandments of Applied Econometrics series. Those interested are encouraged to read the original article by Peter E. Kennedy.