This week's SAS tip is from Randy Collica and his book Customer Segmentation and Clustering Using SAS Enterprise Miner, Second Edition. Randy, a Senior Solutions Architect for SAS, is extremely knowledgeable. His current interests include clustering and ensemble models, knowledge and data engineering, missing data and imputation, and text mining techniques for use in business and customer intelligence. You can benefit from his knowledge this week by taking a look at this free excerpt from his latest book.
The following excerpt is from SAS Press author Randy Collica's book "Customer Segmentation and Clustering Using SAS Enterprise Miner, Second Edition" Copyright © 2011, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED. (please note that results may vary depending on your version of SAS software).
Profiling of Customers and Prospects
One of the main issues in any segmentation is the profile of the segment under question. The profile of that segment is the basic description of the common elements that each customer or prospect shares within that segment. Comparing and contrasting segment profiles allows one to understand better the set of customers represented in each segment. So how does one go about profiling a set of customers? The answer starts in a data assay. The word assay in the Oxford English Dictionary is “the trying in order to test the virtue, fitness, etc. (of a person or thing).” This is what we want to do with data so the data assay produces detailed knowledge, and is usually a report of the quality, problems, shortcomings, and suitability of the data for mining (Pyle 1999, p. 125). The aspects of a data assay typically start with some basic characteristics like the number of unique values for a categorical variable, the percentage of missing values, mean and standard deviation and outliers for numeric variables. This kind of summary, being tabulated in a kind of report, allows one to survey the variables quickly and can give the analyst clues about how to approach certain kinds of data mining. This report can form the foundation for all preparation and mining work that follows. For example, if two categorical variables or columns in a data set each have 10% missing values, then when combinations of these variables are used, the amount of missing of the combined data can be much more than 10% depending on the overlap of the two fields when combined. This type of difficulty has a large negative effect on the outcome of data mining or statistical analyses and one needs to be cognizant that this can happen more often than not on typical data sets. So, let’s begin working with an example to demonstrate a preliminary data assay and start a profile exercise.
In all of the exercises in this book, I will outline a brief process flow table like the following one. The process flow table indicates the major steps that are to be taken to complete the data mining exercise.