This week's SAS author's tip comes from Gerhard Svolba, author of Data Preparation for Analytics Using SAS and the upcoming book Data Quality for Analytics Using SAS. Gerhard is a product manager and pre-sales consultant at SAS in Austria. Gerhard wrote his first book on analytics about five years ago and it continues to perform well.
It's always such a treat catching up with Gerhard at SAS Global Forum or when he's visiting the SAS office here in Cary, NC. He's personable, funny, current, and shares entertaining stories about his three boys and life in Austria. If you, too, would like to get to know Gerhard and what he's currently working on, take a look at his sascommunity.org page.
The following excerpt is from SAS Press author Gerhard Svolba's book "Data Preparation for Analytics Using SAS" Copyright © 2006, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED. (please note that results may vary depending on your version of SAS software)
11.4 Variable Criteria
When creating an analysis table from a data source or a number of data sources we potentially have a lot of variables in the source tables. Whether all these variables are transferred to the analysis table depends on the business question and the data structure.
For example, in the case of the creation of a one-row-per-subject analysis table, where we have data sources with multiple observations per analysis subject, we have to represent the repeated information in additional columns. We discussed this in Chapters 7 and 8.
The following four criteria characterize properties of the resulting derived variables that are created in an analysis table from the data source(s):
Sufficiency means that all potentially relevant information from the available source systems is also found in the analysis table.
Efficiency means that we try to keep the number of variables as small as possible. For example, in the case of transposing repeated observations, we can easily get thousands of variables in the analysis table.
Relevance means that data are aggregated in such a form that the derived variables are suitable for the analysis and business question. For example, in predictive modeling, we want to have input variables with high predictive power.
Interpretability means the variables that are used for analysis can be interpreted and are meaningful from a business point of view. In predictive modeling, however, variables that have high predictive power but are not easy to interpret are often used in the analysis. In this case, other variables are used for model interpretation.
For customers we have a measurement history for the account balance over 20 months. When we have to condense this information into a one-row-per-customer data mart, we are faced with the challenge of creating meaningful variables from the measurement history.
Simply creating 20 variables M1–M20 for the measurements would fulfill the sufficiency requirement. However, creating these variables would not be efficient, nor would it usually provide relevant variables for analysis.
In this case data preparation has to provide meaningful derived values of the measurements over time, which does not necessarily mean that we consider each value itself. Chapter 18 – Multiple Interval-Scaled Observations per Subject will go into detail about this topic and discuss the applications of indicators, trend variables, and moving averages, among others.
Visit Gerhard Svolba's author page to read a free chapter from Data Preparation for Analytics Using SAS, as well as learn more about his upcoming book Data Quality for Analytics Using SAS!