Help! I have too many variables for a cube!

The following email landed in my inbox this morning and it is such a good question that I decided to share because it comes up often. (If you are the sender, thank you for sending this message and pulling me out of a no-blogging-recently slump!)

Dear Cat,
I need some help and I really don’t have a clue where to start. I was introduced to cubes in a SAS programming class, which are really useful for slicing and dicing my data to understand how my key metrics change with critical variables. But my question is this: How do I go about making a determination of which critical variables I should use? Specifically, for the cube I want to to determine which variables to include in a hierarchy and the order to place them. In my work, there is a lot of data available, indeed too much. How can I sort through the different types of data to focus on the variables that really make a difference, and then use these to build the cube?
-Confused**3

Dear Confused**3,

What you have explained, better than I ever could, is the justification for using statistical models. While it's true that cubes are useful for exploring multidimensional relationships between business variables and your target metrics, once you get past 2 or 3 dimensions, cubes become unwieldy. Then you need something to prioritize the business variables, and even then, sometimes there are too many variables, and most variable reduction techniques only look at one pair of variables at a time, which means you completely miss out on things like, demand for seasonal items depends on the time of year, and other interaction effects. Unless you know a priori to include those variables in the cube, you miss the critical relationships.

Another problem with cubes lies in binning your continuous variables. This is a good practice for finding nonlinear associations in the data, but you miss out on the simplicity of interpretations such as, "What is the expected increase in revenue for a $10 increase in advertising expenditure?" You also have to pick (usually) arbirtary split points for grouping, instead of optimizing the split point based on how the data relate to the target metric. For example, splitting temperature into 3 bins where temperature ranges from (-10 to 90 F), you will see mean ice cream sales for under 23 degrees, 24-57 degrees, and 57 degrees and up. But common sense and many decades of eating ice cream tell me that the critical split point is around 80 degrees. Any colder and the kids are buying other kinds of treats.

So what do you do?

I'm glad you asked.

Regression models, neural networks, decision trees, discriminant function models, and many other statistical modeling techniques are suited for these kinds of problems. Some models, such as regression, have built-in solutions to the too-many-variables problem via stepwise or all-subsets selection. Other models, such as decision trees, deal with the problem by ignoring irrelevant variables. Trees also optimize the split points of continuous variables as in the temperature and ice cream demand problem above. All of them are complemented by dimension reduction techniques such as variable clustering. Some of these models are fitted through procedures in SAS/STAT, while others require more specialized software such as SAS Enterprise Miner or JMP. If there is a time dimension in the data, then you really want to pull out the statistical big-guns and consider the kinds of models that are used to forecast time series.

The primary advantages of using statistical modeling techniques over cubes are that they can work with a large number of inputs, find complex higher-order relationships (two-way interactions are just the beginning), and best of all, leave you with a list of the business variables that make the most difference. And, well, really this is the best of all: a predictive model can be used to score new data without refitting the model or having to recreate the cube. Diagnostics make it possible to determine whether your population has shifted and the model needs to be re-estimated. Most of these models are easy to construct and easier to score. Results can be presented to a non-technical audience with simple graphics.

One caution about using simple methods to describe complex systems: ignoring important dimensions can result in an incorrect interpretation of the data. Take, for example, Simpson's Paradox, which occurs when the within-group relationship between variables is functionally different from the between-group relationship. For example, within a country, auto fatalities are negatively associated with income. However, across countries, greater wealth is associated with more auto fatalities. The classic error is to draw some kind of (causal) inference based on either of these pieces of information. Analyzing all the relevant variables together can lead the analyst to a more meaningful conclusion.

So the next time you're sitting down to an afternoon of cube-making, ask yourself: do I really want to go to that much trouble, when a statistical model will be faster, easier, have a longer shelf-life, and give a more realistic picture of the truth? And then spend the rest of your afternoon eating ice cream.

If you would like to learn more about these kinds of techniques, check out some of these classes. And if you have a cool story about replacing simple descriptive reports with statistical models, leave a comment.

Blogs

Blogs

Help! I have too many variables for a cube!

About Author