Occasionally on a discussion forum, a statistical programmer will ask a question like the following:
I am trying to fit a parametric distribution to my data. The sample has a long tail, so I have tried the lognormal, Weibull, and gamma distributions, but nothing seems to fit. Please help!!
In general, there is no reason to expect a particular distribution to fit arbitrary data. However, sometimes the person asking the question has a theoretical reason why the model should fit, such as the data are supposed to be "lifetime" data that are part of a reliability or survival analysis.
For several situations that I can remember, the problem occurred because the data distribution was skewed to the left, whereas by convention the usual "named" distributions have positive skewness. The following sample of 50 data values provides an example:
data Sample; input X @@; datalines; 81 91 90 91 87 56 93 80 80 89 93 87 86 58 81 90 82 71 85 94 86 79 82 89 87 87 96 76 77 91 87 67 93 84 90 88 78 92 87 86 61 82 83 92 81 83 87 91 84 72 ; proc univariate data=Sample; histogram X / endpoints=55 to 100 by 5 odstitle="Distribution of Sample Data"; run;
The descriptive statistics from PROC UNIVARIATE (not shown) indicate that the sample skewness is about -1.5. The histogram confirms that the data distribution has negative skewness. Consequently, the lognormal, Weibull, and gamma distributions will not fit these data well.
A transformation that reverses the data distribution
You can transform the data so that the skewness is positive and the long tail is to the right. To do this correctly requires domain-specific knowledge, but the general idea is to apply a linear transformation of the form Y = c – b X for some constants c and b. If you don't want to change the scale of the data, use b = 1.
For example, suppose that the data are the results of an assessment procedure that assigns a value between 0 (bad) and 100 (good) to each item on an assembly line. An alternative way to score each item is to record the number of points that are deducted by the assessment procedure. For the alternative scoring system, low scores are good and high scores are bad. The conversion between the scoring systems is simply Y = 100 – X. The following DATA step creates the new scores and overlays several parametric models that fit the new transformed data:
data Transform; set Sample; Y = 100 - X; run; proc univariate data=Transform; var Y; histogram / endpoints=0 to 45 by 5 odstitle="Distribution of Reversed Data" lognormal(threshold=0) Weibull(threshold=0) gamma(threshold=0); run;
The transformed data has positive skewness. I used knowledge of the data measurements to choose reasonable values for the linear transformation that flips the data distribution. If you know nothing about the data, you could choose c to be any value greater than the maximum data value for X. That guarantees that the transformed data could be modeled by a distribution that has zero for the threshold parameter. Try to choose a transformation for which the new measurements are easy to understand; different values of c will lead to different estimates for the parameters.
In summary, many standard modeling distributions (exponential, lognormal, gamma, Weibull, ...) assume that the data are positively skewed. If your data has negative skewness, try to use a linear transformation to reverse the data before you model it.