Diagnosis: Your data is not “normal”

0

“Let’s assume a normal distribution …”  Ugh!  That was your first mistake.  Why do we make this assumption?  It can’t be because we want to be able to mentally compute standard deviations, because we can’t and don’t it that way in practice.  No, we assume a normal distribution to simplify our decision making process – with it we can pretend to ignore the outliers and extremes, we can pretend that nothing significant happens very far from the mean.

Big mistake.

There are well over a hundred different statistical distributions other than “normal” available to characterize your data.  Let’s look at a few of those other major categories that describe much of the physical, biological, economic, social and psychological data that we may encounter as part of our business decision and management process.

Risk%20mgmtThe big one when it comes to its business impact is what is commonly known as the “fat tail” (or sometimes, “long tail”).  These are Nassim Taleb’s “Black Swans”.  In the real world, unlikely events don’t necessarily tail off quickly to a near-zero probability, but remain significant even in the extreme, and as Taleb points out, become not just likely over the longer term, but practically inevitable.  It is these fat tail events that leave us scratching our heads when our 95% confident plans go awry.

image63Next up are the bounded, or skewed distributions. Some things are more likely to happen in one direction than in the other.  Unlike with a normal distribution, the mode, median and mean of a skewed distribution are three different values.  ZERO represents a common left-hand bound, where variables cannot take on negative values.  Many production and quality issues have this bounded characteristic, where oversize is less common than undersize because you can always remove material but you can’t put it back on (additive manufacturing excepted).  Too large of a part will sometimes simply just not fit into the tool / jig, but you can grind that piece down to nothing if you’re not paying attention (I have a story about that best saved for another post).

scilab-examples-010Discrete or step-wise functions might describe a number of our business processes.  We make a lot of yes/no, binary, or all-or-nothing decisions in business, where the outcome becomes either A or B but not a lot in between.  In these cases, having a good handle on the limited range over which making an assumption of normality becomes important.

 

325px-Poisson_pmf_svgPoisson distributions.  These describe common fixed-time interval events such as the frequency of customers walking in the door, calls coming into the call center, or trucks arriving at the loading dock.  Understanding this behavior is critical to efficient resource allocation, otherwise you may either overstaff, influenced by the infrequent peaks, or understaff without the requisite flexibility to bring additional resources to bear when needed.

 

325px-Exponential_pdf_svgPower laws.  Would you think that the population of stars in the galaxy follows a normal distribution, with sort of an average sized star being the most common?  Not even close.  Small brown and white dwarfs are thousands of times more common than Sun-sized stars, which are tens of thousands of times more common than blue and red giants like Rigel and Betelgeuse.  Thank goodness things like earthquakes and tornados follow this pattern, known as a “power law”.

2000px-Barabasi-albert_model_degree_distribution_svgMuch of the natural world is governed by power laws, which look nothing at all like a normal distribution.  Smaller events are orders of magnitude more likely to occur than medium sized events, which in turn are orders of magnitude more likely than large ones.  Power laws grow exponentially in hockey stick fashion, but are typically displayed on a logarithmic scale, which converts the hockey stick into a straight line (left). Don’t let the linearity fool you, though – that vertical scale is growing by a factor of ten with each tick mark.

Brunswick stock price chart2That’s financial data over there to the right – can you tell without the axis labels if that’s monthly, hourly or per-minute price data?  Or, it could just as easily be your network traffic, again measured by the second or by the day.  This type of pattern is known as fractal, with the key property of self-similarity: it looks the same no matter what scale it is observed at.  Fractals conform to power laws, and therefore there are statistical approaches for dealing with them.

One piece of good news is that when it comes to forecasting, you don’t have to worry about normality - forecasting techniques do not depend on an assumption of normality. Knowing how to handle outliers, however, is crucial to forecast accuracy.  In some cases they can be thrown out as true aberrations / bad data, but in other cases they really do represent the normal flow of business and you ignore them at your peril.  In forecasting, outliers often represent discrete events, which can be isolated from the underlying pattern to improve the baseline forecast, but then deliberately reintroduced when appropriate, such as holidays or extreme weather conditions.

What we’ve just discussed above is called data characterization, and is standard operating procedure for your data analysts and scientists.  Analytics is a discipline. One of the first things your data experts will do will be to run statistics on the data to characterize it – tell us something about its underlying properties and behavior – as well as analyze the outliers, all part of the discipline or culture of analytics.

Economists like to assume the “rational economic man” – it permits them to sound as if they know what they are talking about.  Likewise, assuming a “rational consumer” (customer data is going to comprise a huge chunk of your Big Data) who behaves in a “normal” fashion is pushing things beyond the breaking point.  While plenty of data sets are normal (there are no humans ten times the average height, let alone even twice), don’t assume normality in your data or your business processes where it’s not warranted.

Soon enough we’ll probably drop the “big” from Big Data and just get on with it, but still, your future is going to have a LOT of data in it, and properly characterizing that data using descriptive analytics in order to effectively extract its latent value and insights will keep your Big Data exercise from turning into Big Trouble.

Share

About Author

Leo Sadovy

Marketing Director

Leo Sadovy currently manages the Analytics Thought Leadership Program at SAS, enabling SAS’ thought leaders in being a catalyst for conversation and in sharing a vision and opinions that matter via excellence in storytelling that address our clients’ business issues. Previously at SAS Leo handled marketing for Analytic Business Solutions such as performance management, manufacturing and supply chain. Before joining SAS, he spent seven years as Vice-President of Finance for a North American division of Fujitsu, managing a team focused on commercial operations, alliance partnerships, and strategic planning. Prior to Fujitsu, Leo was with Digital Equipment Corporation for eight years in financial management and sales. He started his management career in laser optics fabrication for Spectra-Physics and later moved into a finance position at the General Dynamics F-16 fighter plant in Fort Worth, Texas. He has a Masters in Analytics, an MBA in Finance, a Bachelor’s in Marketing, and is a SAS Certified Data Scientist and Certified AI and Machine Learning Professional. He and his wife Ellen live in North Carolina with their engineering graduate children, and among his unique life experiences he can count a singing performance at Carnegie Hall.

Comments are closed.

Back to Top