This is the second post in my series of machine learning best practices. If you missed it, read the first post, Machine learning best practices: the basics. As we go along, all ten tips will be archived at this machine learning best practices page.
Machine learning commonly requires the use of highly unbalanced data. When detecting fraud or isolating manufacturing defects, for example, the target event is extremely rare – often way below 1 percent. So, even if you’re using a model that’s 99 percent accurate, it might not correctly classify these rare events.
What can you do to find the needles in the haystack?
A lot of data scientists frown when they hear the word sampling. I like to use the term focused data selection, where you construct a biased training data set by oversampling or undersampling. As a result, my training data may end up slightly more balanced, often with a 10 percent event level or more (See Figure 1). This higher ratio of events can help the machine learning algorithm learn to better isolate the event signal.
For reference, undersampling removes observations at random to downsize the majority class. Oversampling up-sizes the minority class at random to decrease the level of class disparity.
Another rare event modeling strategy is to use decision processing to place greater weight on correctly classifying the event.
The table above shows the cost associated with each decision outcome. In this scenario, classifying a fraudulent case as not fraudulent has an expected cost of $500. There's also a $100 cost associated with falsely classifying a non-fraudulent case as fraudulent.
Rather than developing a model based on some statistical assessment criterion, here the goal is to select the best model that minimizes total cost. Total cost = False negative X 500 + False Positive X 100. In this strategy, accurately specifying the cost of the two types of misclassification is the key to the success of the algorithm.
If there are other tips you want me to cover, or if you have tips of your own to share, leave a comment here.