Who Ate My Lunch? Discriminant Thresholds to Reduce False Accusations

1

Lunch. For some workers, it’s the sweetest part of an otherwise bitter day at the grindstone. Nothing can turn that sweetness sour like going into the breakroom to discover that someone has taken your lunch and eaten it themselves.

Nothing like that ever happens here at SAS.

But if it did, I would set up a system to repeatedly collect and identify the saliva of the top suspects, and do an elegant chemical analysis. When a lunch goes missing, there’s always some residual spit on the used container.

I could develop a discriminant analysis model to identify each suspect. Then I’d score newly missing lunches with the model, flag the culprit, track them down and make them buy a box of Belgian chocolates for the person whose lunch they pilfered.

But what if I falsely accused someone who was innocent? Oh gosh. That could be an embarrassing and expensive error.

Let’s review how the discriminant analysis would look:

As you can see, I have specified custom priors because my training data are balanced: I have 100 spit samples per suspect. But everyone in the office knows that Jeff* is the least likely to pilfer and Heath the most, and I want this to be accounted for in the classification. The priors that are specified reflect the marginal probability that the lunch was taken by each colleague, independent of saliva analysis.

The inputs, logx1 through logx4, are the chemical analysis variables. They are super-secret and proprietary. Let’s just say… I know who drinks coffee with nondairy whitener and who prefers tea with lunch.

The classification summary looks like this:

I did a great job of classifying! But let’s look at the kind of mistakes I’m making as well…

(this is just a sampling of the output)
Each one of those “*” is a misclassifications. Each one is a false accusation against an innocent colleague. And look, for example, at observation #64: it’s almost a coin-toss as to whether it looks more like Jeff or Elizabeth. I need to be more certain than that if Belgian chocolates are involved.

It would be better to let some of the more ambiguous cases go by the wayside. Enter the THRESHOLD= option. Surprisingly, not a lot of people I encounter are aware of this option, and instead use a DATA step to re-code the classifications based on the posterior probabilities. Here’s how it works:

THREHSOLD=0.8 means that an observation is only classified into a group if its posterior probability is greater than 0.8. Others are classified as “Other.” Let’s revise the analysis and see what happens:

There are lots of “Other” but very few false accusations. Any that are falsely accused, well, they probably stole someone else’s coffee that morning.

Office espionage aside, this cautious approach to scoring is really useful in the context of a classification scheme where false positives are extremely costly. Now, go forth and eat (your own!) lunch.

*the names in this blog are entirely ficitious and any similarity to good friends at work who I eat lunch with every day is purely coincidental.

Share

About Author

Catherine Truxillo

Catherine Truxillo, Ph.D. has been a Statistical Training Specialist at SAS since 2000 and has written or co-written SAS training courses for advanced statistical methods including: multivariate statistics, linear and generalized linear mixed models, multilevel models, structural equation models, imputation methods for missing data, statistical process control, design and analysis of experiments, and cluster analysis. Although she primarily works with advanced statistics topics, she also teaches SAS courses using SAS/IML (the interactive matrix language), SAS Enterprise Guide, SAS Enterprise Miner, SAS Forecast Studio, and JMP software. Before coming to SAS, Catherine completed her Ph.D. in Social Psychology with an emphasis in Statistics at The University of Texas at Austin. While at UT Austin, she completed an internship with the Math and Computer Science department's statistical consulting help desk and taught a number of undergraduate courses. While teaching and performing her own graduate research, she worked for a software usability design company conducting experiments to assess the ease-of-use of various software interfaces and website designs. Cat's personal interests include triathlon, hiking the woods near her home in North Carolina, and having tea parties with her two children.

1 Comment

  1. you've just inspired a non-statistician to want to play with proc discrim with your easy-to-understand lunch example. thanks!

Leave A Reply

Back to Top