In a previous article, I discussed random jittering as a technique to reduce overplotting in scatter plots. The example used data that are rounded to the nearest unit, although the idea applies equally well to ordinal data in general.
The act of jittering (adding random noise to data) is a statistical irony: statisticians spend most of their day trying "remove" noise from data, but jittering puts noise back in!
Personally, I rarely jitter data. I prefer to visualize the data as they are, but I acknowledge that there are situations in which jittering gives a better "feel" for the data. To help you decide on whether or not to jitter, here are some pros and cons to jittering.
Arguments in Favor of Jittering
- Jittering reduces overplotting in ordinal data or data that are rounded.
- Jittering helps you to better visualize the density of the data and the relationship between variables.
- Jittering can help you to find clusters in the data. (Use a small scale parameter for this case.)
Arguments Against Jittering
- Jittering adds random components to variables, which means that there is not a unique way to jitter.
- The size of the random component is not easy to automate, but requires domain-specific knowledge. For example, are data recorded to the nearest unit or the nearest half-unit?
- The distribution of the random component is not always clear. In the iris data, I jittered by using random variables from a uniform distribution. But suppose a variable records the Richter scale intensity of earthquakes (rounded to the nearest 0.1). Should you use a uniform distribution to jitter these data? Probably not, because the Richter scale is a logarithmic scale, and because earthquakes with lower intensities occur more frequently than earthquakes with higher intensities.
- If the X and Y variables are related (for example, highly correlated), jittering each variable independently might result in a graph in which the visual impact of the relationship is less apparent. Think about the extreme case where X and Y are exactly linearly related: adding independent noise to each variable results in a graph in which the linear relationship is not as strong.
Can you think of any arguments that I did not mention? Do you think the arguments in favor of jittering outweigh the arguments against it? Are there other visualization techniques that you prefer instead of jittering? Weigh in on this issue by posting a comment.