Who Ate My Lunch? Discriminant Thresholds to Reduce False Accusations

1

Lunch. For some workers, it’s the sweetest part of an otherwise bitter day at the grindstone. Nothing can turn that sweetness sour like going into the breakroom to discover that someone has taken your lunch and eaten it themselves.

Nothing like that ever happens here at SAS.

But if it did, I would set up a system to repeatedly collect and identify the saliva of the top suspects, and do an elegant chemical analysis. When a lunch goes missing, there’s always some residual spit on the used container.

I could develop a discriminant analysis model to identify each suspect. Then I’d score newly missing lunches with the model, flag the culprit, track them down and make them buy a box of Belgian chocolates for the person whose lunch they pilfered.

But what if I falsely accused someone who was innocent? Oh gosh. That could be an embarrassing and expensive error.

Let’s review how the discriminant analysis would look:

As you can see, I have specified custom priors because my training data are balanced: I have 100 spit samples per suspect. But everyone in the office knows that Jeff* is the least likely to pilfer and Heath the most, and I want this to be accounted for in the classification. The priors that are specified reflect the marginal probability that the lunch was taken by each colleague, independent of saliva analysis.

The inputs, logx1 through logx4, are the chemical analysis variables. They are super-secret and proprietary. Let’s just say… I know who drinks coffee with nondairy whitener and who prefers tea with lunch.

The classification summary looks like this:

I did a great job of classifying! But let’s look at the kind of mistakes I’m making as well…

(this is just a sampling of the output)
Each one of those “*” is a misclassifications. Each one is a false accusation against an innocent colleague. And look, for example, at observation #64: it’s almost a coin-toss as to whether it looks more like Jeff or Elizabeth. I need to be more certain than that if Belgian chocolates are involved.

It would be better to let some of the more ambiguous cases go by the wayside. Enter the THRESHOLD= option. Surprisingly, not a lot of people I encounter are aware of this option, and instead use a DATA step to re-code the classifications based on the posterior probabilities. Here’s how it works:

THREHSOLD=0.8 means that an observation is only classified into a group if its posterior probability is greater than 0.8. Others are classified as “Other.” Let’s revise the analysis and see what happens:

There are lots of “Other” but very few false accusations. Any that are falsely accused, well, they probably stole someone else’s coffee that morning.

Office espionage aside, this cautious approach to scoring is really useful in the context of a classification scheme where false positives are extremely costly. Now, go forth and eat (your own!) lunch.

*the names in this blog are entirely ficitious and any similarity to good friends at work who I eat lunch with every day is purely coincidental.

Share

About Author

Catherine (Cat) Truxillo

Director of Analytical Education, SAS

Catherine Truxillo, Ph.D. has written or co-written SAS training courses for advanced statistical methods, including: multivariate statistics, linear and generalized linear mixed models, multilevel models, structural equation models, imputation methods for missing data, statistical process control, design and analysis of experiments, and cluster analysis. She also teaches courses on leadership and communication in data science.

1 Comment

  1. you've just inspired a non-statistician to want to play with proc discrim with your easy-to-understand lunch example. thanks!

Back to Top