The first post of the 10 Commandments of Applied Econometrics series discussed the importance of the use of common sense and understanding of the theory of econometrics in data analysis. Today, I will present the next two commandments related to putting the statistical tools in the business context of a problem.
2. Avoid type III errors.
In other words: ask relevant questions. A type III error occurs when an analysis produces the right answer to the wrong question. This concept was introduced by A. Kimball in 1957, in his article entitled “Errors of the third kind in statistical consulting”, in which he shared his experience in the area of statistical consulting. A corollary of this rule is that an approximate answer to the right question is worth a great deal more than a precise answer to the wrong question. We should always have in mind the purpose of the analysis because without it achieving the expected results is impossible. A good method to avoid the type III error is to ask many questions – even if answers seem to be obvious. Because, as they say, “Better to ask the way than go astray”. So, it pays off to make an extra effort and make sure that we fully understand the purpose of the analysis and the methods we are going to use.
The definition of the target variable in predictive modelling is a good example. In the majority of textbook examples the dependent variable, is defined in advance. In practice, a researcher working on a predictive model must develop, together with the forecast recipients, an accurate formula for calculating the target variable. It is not always obvious what is to be forecast. Take for example an analysis of customer churn for a mobile operator. In order to perform such an analysis, the customers who left the operator must be first identified. While for subscription customers it is easy to identify if the contract has been renewed or not, for prepaid customers their status is not always clear. This is because those customers are not required to apply officially to cancel their number – they just cease to use it. There is also a group of customers who neither renew nor terminate their contracts. In what category do they fall? When can it be considered that a customer ceased to use his or her phone? After a month of inactivity? Or perhaps after three months? Should we consider outgoing calls only or do incoming calls also count? In order to perform a reliable analysis of churn, a number of details need to be precisely defined. Analysts, together with forecast recipients, must establish a calculable definition of an inactive customer. It is quite likely that such a definition already exists in the marketing department – if so, only the data for the operational calculation must be located. A similar problem exists in loyalty programmes for customers of retail stores, pharmacies or filling stations. Does “no purchases for 3 months” mean that the customer is inactive? The answer depends on the nature of sales, industry and expectations of business customers.
A wrong definition of the target variable that has not been consulted with the business will lead to building models that answer wrong questions. In such a case, the model may have the highest goodness of fit and provide accurate forecasts but for a different phenomenon than originally intended. Consequently, business benefits from the forecast may be unsatisfactory because the model answers a different question – perhaps irrelevant for the forecast recipient. A similar issue of target variable definition arises in demand forecasting. The actual demand, defined as the propensity to purchase a product at a certain price, is impossible to observe in practice. This is why we try to estimate it on the basis of sales figures, orders received or movement of goods in warehouses. The prerequisites for precisely defining the target variable (in this example: the demand) include understanding the data collection methods and establishing what combination of available variables can best reflect the historic information about the unobservable demand.
3. Know the context.
Or avoid performing an ignorant statistical analysis. This commandment is inextricably linked with the previous one. Before asking the good question, to be answered by the analysis, the researcher must become intimately familiar with the phenomenon being investigated and review the literature. It is also important to understand the methodology behind the acquisition of data or selection of individuals for the analysis. The sample selection method is the key factor for the interpretation of results and the possibility to generalise the findings of the analysis.
Let’s take the example of a bank that processes loan applications. On the basis of information provided by customers, it assesses the probability of repayment of the loan and decides to grant it (or not). The bank uses a scoring model which is fed with data on loans granted in the past and information if customers repaid them or not. It can be expected that students – as young people without a long credit history, regular job or significant income – will seldom earn a good credit score. So those of them who actually get the loans will usually have some specific attributes that improve their creditworthiness. It will be a non-random group of students for whom the probability of repayment of the loan is high. Then, if we look at the scoring model built on the sample of customers who were granted loans the fact of being a student should have a positive effect on the probability that the loan will be repaid. If we do not know the context of the model and the methodology used for collecting data the result seems to be non-intuitive or even illogical because it can be interpreted as follows: students usually repay their loans so it pays off to grant loans to them. However, if we know the context, we also know that the model sample is a non-random and biased because only those customers (including the specific group of students) are analysed who were granted loans. We have no information on those customers whose applications were rejected so we cannot generalise the results to the entire population of students. The problem described above is called reject inference and the ways of solving it with the SAS Enterprise Miner tool have been presented in a SAS Global Forum 2010 article and a video by Miguel Maldonado.
In the next post of the 10 Commandments of Applied Econometrics series, we will discuss another important issue: exploration and inspecting the data. Those interested are encouraged to read the original article by Peter E. Kennedy.
2 Comments
Pingback: Data set exploration is the basis of knowledge of data
Pingback: Use Data Mining and Machine Learning with caution | Bright Data