Imagine you're an investigator sorting through 6,000 mug shots taken at different jails around the state. You are tasked with finding all the mug shots of a high-profile con artist who has been arrested in multiple cities using a number of different aliases over the last five years.
How can you use math to help? With principal component analysis.
Principal component analysis is a statistical technique that groups multiple variables together into several components, to reduce the amount of data that needs to be analyzed.
Using the mug shots in our example, the process identifies a selection of correlated measurements, maybe the distance between the eyes, the width of the nose and the height of the ears. It then combines those measurements into a reduced number of newly created features – the principal components – that best represent the variation in the images. Finally, the computer compares your suspects’ unique features to the principle components and quickly finds five mug shots that match.
This simple example illustrates a complex process that is used in many industries to solve problems like:
- Fraud detection.
- Word recognition.
- Speech recognition.
- Spam detection.
“A typical data mining problem would look at hundreds or thousands of features,” explains Research Statistician Developer Funda Gunes in a recent webinar about principal component analysis. “Many features are correlated, though, which leads to redundancy in the data.”
Principal component analysis reduces the redundancy to make it easier for computers to compare variables and look for patterns. This is important not only for speed but also for precision, explains Gunes. “Redundancy in data causes over fitted models and reduces prediction accuracy.”
Other benefits include reducing the amount of memory and disk space needed for storing data, identifying hidden structures, detecting outliers and allowing for visualization of data sets with many variables.
Speaking mathematically, Gunes, explains that principal components are linear combinations of the original variables that reduce the dimensionality of a data set. “Principal component analysis converts a set of possibly correlated features into a set of linearly uncorrelated features called principal components,” she says.
To learn more about the math behind principal component analysis – and how to identify first and second principal components, watch the Webinar, Principal Component Analysis for Machine Learning.