In the second post of the 10 Commandments of Applied Econometrics series we discussed the issue of embedding statistical tools in the context of business problems. Today, I will present another commandment related to exploration and inspecting the data.
- Inspect the data.
An in-depth examination of the problem’s context is a critical part of the analytical process. However, before you start the modelling process, you mustn’t forget about another important step: exploration of the specific set of data. In the era of the digital revolution, when there are tools available that can estimate models with one click, researchers are more and more often accused of not paying enough attention to sufficient quality or in-depth inspection of data. So what should be done then?
The following three components form the data exploration process: computing summary statistics, creating graphs and cleaning the data. Only in very few cases complicated statistics must be calculated to inspect data thoroughly. It turns out very often that simple and basic statistics are enough – such as mean value, median, standard deviation, maximum and minimum values or correlation matrix. In addition to these statistics, some graphs may be helpful as well – for example histograms, box charts or residual plot. With such visualization, we may discover surprising relations that otherwise might stay unnoticed. Data cleaning, or elimination of inconsistencies, is another important step in the data mining process. In this step, we focus on variable values, and we try to identify those that seem unrealistic or suspicious: the outliers. The issue of missing data should be taken care of as well. We should examine their scale, how they are coded, and then decide how to handle them in our analysis. We can create an additional variable category for them, perform imputation (replacement) of missing data, or exclude observations with missing values from the modelling process.
Income is a good example of a variable that should be examined first. It is often used as an explanatory variable in different models – from scoring models used by banks to quality of life and budget surveys in social statistics, and analyses performed by marketing departments. Income is quite a specific variable. Its distribution is characterized by right-hand asymmetry. It reflects the fact that for any sample, earnings of the majority of subjects are below the average value, but there is also a small group of people with very high income. This results in a mean value that is quite high and heavily affected by the outliers. For example, a company named X employs 10 people: 8 employees receive monthly salaries of PLN 1,000 and the other 2 employees earn PLN 10,000 and 15,000, respectively. In this example, the average salary is PLN 3,300 but this value does not reflect the reality very well. When the asymmetry is right-hand, median ≤ mean value. This is why the income of average John Doe can be better described with the median (in the above example the median of salaries is PLN 1,000). There is one more issue with income-related variables that must be handled too. People are cautious when it comes to providing such information because they are afraid that it can be used against them or give rise to an investigation. They often declare lower income and do not mention the sources of income that may be found illegal. This obstacle can be overcome to a certain extent by assigning respondents to pre-defined income ranges instead of asking them about the exact income. There is also the issue of missing data for income-related variables. Those with the highest income may be most reluctant to provide information about it. So it is quite likely that the existence of missing data is not random and that it carries some information that should be considered in the model as well. Therefore, when we examine an income-related variable we should check if its distribution asymmetry is right-hand (and if not – can it be reasonably explained?), analyse the summary statistics and select those which will be used in the analysis, and decide how to handle the missing data. For variables with right-skewed distribution, transformation with the use of increasing functions with negative second derivative is a good practice. For this purpose, logarithms, roots or Box-Cox transformation can be used. The demo: SAS Enterprise Miner: Impute, Transform, Regression & Neural Models shows how it can be done in SAS Enterprise Miner.
In the next post of the 10 Commandments of Applied Econometrics series, to be published soon, we will discuss another two important issues: keeping the solutions sensibly simple and using the interocular trauma test. Those interested are encouraged to read the original article by Peter E. Kennedy.