My oldest son is in the school band, and they are getting ready for their spring concert. Their fall concert was wonderful; hearing dozens of students with their specific instruments playing together creates beautiful, rich sounding music. The depth of sound from orchestral or symphonic music is unmatched. In data mining, and specifically in the area of predictive modeling, a similar effect can be created using ensembles of models that leads to results that are more “beautiful” than a single model. A predictive model ensemble combines the posterior predictions from more than one model. When you combine multiple models together you create model crowdsourcing. Each individual model is described by a set of rules, and when the rules are applied in concert you can consider the "opinions" of many models. How to use these opinionated models depends on the goal. The two main ways are to (1) let every model vote and decide democratically the target label or (2) label the target with the opinion of the most confident model (probabilistically speaking).

Types of Ensembles

The two main forms of ensembles are boosting and bagging (more specifically called bootstrap aggregating). The most popular forms of ensembles are using decision trees. Random forest and gradient boosting machines are two examples that are very popular in the data mining community right now. While decision trees are the most popular they are not the only ensemble algorithm. Any model algorithm can be part of an ensemble and heterogenous ensembles can be quite powerful.

Bagging

Bagging, as the name alludes, takes repeated unweighted samples with replacement of the data to build models and then combines them. Think of your observations like grains of wild rice in a bag. Your objective is to identify the black grains because they have a resale price 10x greater when sold separately.

Take a scoop of rice from the bag.
Use your scoop of rice to build a model based on the grain’s characteristics, excluding that of color.
Write down your model classification logic and fit statistics.
Pour the scoop of rice back into the bag.
Shake the bag for good measure and repeat.

How big the scoop is relative to the bag, and how many scoops you take, will vary by industry and situation, but I usually use 25-30% of my data and take 7-10 samples. This results in a likelihood that every observation will be included 1-2 times in the model.

Boosting

Boosting is similar to bagging except that the observations in the samples are now weighted. To follow the rice problem from above, after step 3 I would take the grains of rice I had incorrectly classified (e.g. black grains I said were non-black or non-black grains I thought were black) and place them aside. I would then take a scoop of rice from the bag and leave some room to add the grains I had incorrectly classified. By including previously misclassified grains at a higher rate, the algorithm has more opportunities to identify the characteristics for correct classifications. This is the same idea behind giving more time to review flashcards of facts you didn’t know than those you did. For what it's worth, I tend to use bagging models for prediction problems and boosting for classification problems. By taking multiple samples of the data and modelling over iterations you allow factors that are otherwise weak to be explored. This provides a more stable and generalizable solution. When model accuracy is the most important consideration, ensemble models will be your best bet. This topic was recently discussed in much greater detail at SAS Global Forum. See this paper by Miguel Maldonado for more details.

Image credit: photo by Ludovico Sinz // attribution by creative commons