The post Distributions with specified skewness and kurtosis appeared first on The DO Loop.

]]>
The moment-ratio diagram is a statistical tool that helps to understand how to answer this question. This article shows how to use the moment-ratio diagram to find
parameter values for the Beta(a,b) distribution that achieve a wide range of skewness-kurtosis values. The same ideas apply to finding parameters for other distributions.
Recall that the *full kurtosis* is 3 more than the *excess kurtosis*. Both are popular measures, but in this article, "kurtosis" refers to the *full* kurtosis.

The moment-ratio diagram shows the range of possible skewness and kurtosis values ((s,k) values, for short) for many common families of probability distributions. Some families (such as the normal distribution) can only achieve a single (s,k) value. Other families (such as the gamma distribution) can achieve (s,k) values along a curve. Other families (such as the beta distribution) can achieve (s,k) values in a region.

In the diagram to the right, the gray region corresponds to the possible (s,k) values for the Beta distribution.
You can look up the formulas for the skewness and kurtosis of the Beta(a,b) distribution. The formulas for a > 0 and b > 0 are

**Skewness:** s(a,b) = ( 2 (b-a) sqrt(a+b+1) ) / ( (a+b+2) sqrt(a b) )

**Full Kurtosis:** k(a,b) = 3 + 6 ( (a-b)^{2} (a+b+1) - a b (a+b+2) ) /
( a b (a+b+2) (a+b+3) )

In other words, the gray region in the moment-ratio diagram is the image of the parameter region {(a,b) | a > 0, b > 0} under the nonlinear transformation (a,b) → (s,k).

The equations that define the Beta region in the moment-ratio diagram are well known.
The lower bound of the Beta region is the boundary of the "impossible region," so

k > 1 + s^{2}

as it is for all distributions. The upper bound of the Beta region is the curve defined by the gamma distribution family, so

k < 3 + 1.5 s^{2}.

Consequently, a good choice for a "representative" curve of (s,k) values for the beta distribution is the average of the two boundary curves, which is

k = 2 + 1.25 s^{2}.

It suffices to consider only nonnegative skewness because if X is a Beta-distributed random variable that has skewness s, then 1-X is a beta-distributed variable that has skewness -s. Symmetric distributions (s=0) occur when a=b.

The following set of evenly spaced skewness values are a good choice for simulating random samples from the Beta distribution that have a wide range of skewness (s) and kurtosis (k) values:

/* specify values in the moment-ratio diagram for which the Beta distribution has a variety of (s,k) values */ data SKBetaParms; do s = 0 to 2.4 by 0.1; s = round(s,0.1); k = 2 + 1.25 * s**2; /* middle of the Beta region */ output; end; run; |

For each (s,k) value in the moment-ratio diagram, you can find the corresponding (a,b) parameter value for the Beta distribution that has the specified skewness and kurtosis. You can do this by solving for the root of the vector-valued function M(a,b) - (s,k), where M(a,b) is the transformation that maps parameter values to the corresponding (s,k) values. This can be done by using the SAS IML language, which enables you to define the vector-valued mapping and solve nonlinear equations in a least-square sense.

The steps in the process are as follows:

- Define a function (SKBetaFun) that takes an (a,b) input value and returns an (s,k) value as output.
- Define a function (VecFun) that evaluates the vector-valued function M(a,b) - (s,k).
- Define a function (SolveForBetaParam) that takes an (s,k) value and calls the NLPHQN subroutine in SAS IML to obtain an (a,b) value such that the norm of the VecFun function is minimized. Actually, SAS IML functions can take vectors of arguments, so the SolveForBetaParam can take vectors of several values and return several (a,b) values.
- For the (s,k) value in the SKBetaParms data set, call the SolveForBetaParam function to get the corresponding (a,b) parameters.

proc iml; /* Helper: return the skewness of the Beta(a,b) distribution */ start SkewBeta(a,b); return ( 2*(b-a)#sqrt(a+b+1) ) / ( (a+b+2)#sqrt(a#b) ); finish; /* Helper: return the full kurtosis of the Beta(a,b) distribution */ start KurtBeta(a,b); return 3 + 6* ( (a-b)##2 # (a+b+1) - a#b#(a+b+2) ) / ( a#b#(a+b+2)#(a+b+3) ); finish; /* 1. Define a function takes an (a,b) value and returns an (s,k) value */ start SKBetaFun(a,b); return ( SkewBeta(a,b) || KurtBeta(a,b) ); /* return a ROW vector */ finish; /* 2. Define a function that evaluates the vector-valued function M(a,b) - (s,k) */ start VecFun(param) global(g_skewTarget, g_kurtTarget); a = param[1]; b = param[2]; target = g_skewTarget || g_kurtTarget; return( SKBetaFun(a,b) - target ); finish; /* 3. Define a function that takes a vector of (s,k) values and calls the NLPHQN subroutine in SAS IML to obtain (a,b) values that minimize the norm of the VecFun function */ start SolveForBetaParam(skew, kurt, printLevel=0) global(g_skewTarget, g_kurtTarget); /* a b constraints. Lower bounds in 1st row; upper bounds in 2nd row */ con = {1e-6 1e-6, /* a > 0 and b > 0 */ . . }; x0 = {1 1}; /* initial guess */ optn = 2 // /* solve least square problem that has 2 components */ printLevel; /* amount of printing */ ab = j(nrow(skew), 2, .); /* return the a and b vectors as columns in matrix */ do i = 1 to nrow(skew); g_skewTarget = skew[i]; g_kurtTarget = kurt[i]; call nlphqn(rc, Soln, "VecFun", x0, optn) blc=con; /* solve for LS soln */ if rc>0 then ab[i,] = Soln; end; return( ab ); finish; |

You can read in the skewness-kurtosis pairs for the moments of the Beta distribution, then call the SolveForBetaParam function to obtain the corresponding (a,b) pairs:

/* 4. use values of (skew, kurt) that are in the middle of the Beta region. Solve for (a,b) parameters. */ use SKBetaParms; read all var {'s' 'k'}; close; Soln = SolveForBetaParam(s, k); a = Soln[,1]; b = Soln[,2]; print s k a[F=5.2] b[F=5.2]; |

A few of the (a,b) parameter values are shown. For example, the parameter a = b = 1.5 define a symmetric Beta distribution that has full kurtosis equal to 2. The Beta(1.14, 1.86) distribution as skewness=0.4 and full kurtosis=2.2.

The (a,b) pairs are the preimage of points on the skewness-kurtosis curve in the center of the Beta region in the moment-ratio diagram. It isn't apparent from the table, but the preimage is a straight line in the (a,b) parameter space, as shown by the following graph. Notice, however, that the (a,b) values are not uniformly distributed along the line.

Now that we know the (a,b) values that correspond to the specified values of skewness and kurtosis, what do these distributions look like? One way to visualize the distributions is to plot the PDF for a several (a,b) pairs. The following statements write the (a,b) pairs to a SAS data set, then creates a graph that overlays only four density curves for which the distribution has skewness values in the set {0, 0.5, 1, 2}. Each PDF is labeled by its parameters:

/* what do the curves look like? */ create Params var {'s' 'k' 'a' 'b'}; append; close; QUIT; data PDF; set Params(WHERE=(s in (0, 0.5, 1, 2))); /* display these four curves */ /* https://blogs.sas.com/content/iml/2018/08/08/plot-curves-two-categorical-variables-sas.html */ Group = catt("(a,b) = (", putn(a,5.2)) || "," || catt(putn(b,5.2)) || ")"; do x = 0.001, 0.005, 0.01 to 0.99 by 0.01, 0.999; PDF = pdf("Beta", x, a,b); output; end; run; title "Four Beta Distributions with Specified Skewness and Kurtosis"; title2 "Skewness = {0, 0.5, 1, 2}"; proc sgplot data=PDF; series x=x y=PDF / group=Group; xaxis grid; yaxis grid min=0 max=5; run; |

The graph shows one symmetric Beta PDF function and other function that have various amounts of positive skewness.

The moment-ratio diagram shows the possible skewness and kurtosis values for probability distributions. For example, the beta region is a U-shaped region. For each (s,k) pair in the moment-ratio diagram, there is a corresponding pair of parameter values, (a,b), such that Beta(a,b) has skewness=s and kurtosis=k. You can use nonlinear least squares optimization to find the (a,b) parameters for each feasible (s,k) value. This article uses SAS to find (a,b) parameter values for a range of (s,k) values. After you have obtained the (a,b) values, you can use them to simulate random Beta samples as part of a simulation study.

The post Distributions with specified skewness and kurtosis appeared first on The DO Loop.

]]>The post Improve the Federal Reserve's dot plot appeared first on The DO Loop.

]]>
A *dot plot* is a standard statistical graphic that displays a statistic (often a mean) and the uncertainty of the statistic for one or more groups. Statisticians and data scientists use it in the analysis of group data.
In late 2023, I started noticing headlines about "dot plots" in the national news media such as Bloomberg's article, "What is the Fed's dot plot and why does it matter?" (A. Bull, Mar 19, 2024),
and the *New York Times* article, "How to Read the Fed's 'Dot Plot' Like a Pro"
(J. Smialek, Dec 13, 2023).
Can it be that the statistical dot plot has gone mainstream?

The "Fed" is the Federal Reserve Bank of the United States, which is the central bank responsible for setting monetary policy, including the *federal funds rate*, which is the interest rate that commercial banks charge each other for overnight lending. That rate affects other interest rates, such as corporate loans, credit cards, and mortgages.
The *New York Times* article states, "When the central bank releases its Summary of Economic Projections each quarter, Fed watchers focus obsessively on one part in particular: the so-called dot plot." An example of the Fed dot plot from the
Fed's report on March 20, 2024, is shown to the right.

I usually call this kind of plot a *strip plot*, but to avoid confusion I will call it a dot plot in this article.
This article discusses how to interpret a statistical dot plot and how to create it in SAS.
I also suggest an improvement to the Fed's dot plot.

Economists and corporations look at the Fed's dot plot to try to predict whether interest rates will be rising or falling. There are 19 members of the Federal Open Market Committee (FOMC). Each member can contribute a dot anonymously, so the dot plot has up to 19 points for each year and for the "longer run." In March 2024, one member did not supply a forecast for the "longer run," so there are only 18 dots for that category.

How do you interpret a dot plot that has a discrete X axis and a continuous Y axis? The first feature to notice is the vertical spread within each category. Within each year, are the forecasts close together or are they spread far apart? A small spread indicates consensus among the FOMC members; a large spread indicates uncertainty and disagreement of opinion. Readers are more confident in the forecasts when they see a small spread. For the March 2024 dot plot, there is general agreement for 2024, but less agreement for 2025 and 2026.

In the Fed's dot plot, the X axis represents time. Consequently, the second feature to notice is the trend of the forecasts. It is easier to visualize the trend if you estimate the median forecast for each time period. For the March 2024 dot plot, the trend indicates a consensus that interest rates will decrease in 2025 and 2026.

The dot plot from the Fed rounds the member's forecasts to the nearest eighth of a percentage point. This means that we don't know the true forecasts. For example, a member who forecasts a rate of 4.1 is represented by a dot a 4.125 (=4 1/8). Surprisingly, most forecasts from March 2024 are rounded to 1/8, 3/8, 5/8, or 7/8 ("eighths") rather than "fourths" such as 1/4, 1/2, and 3/4. To save typing, the following SAS DATA step use the ROUND function to round hypothetical forecasts to the values shown on the dot plot:

data FedRates; length TimePeriod $11; input TimePeriod 1-11 @; do i = 1 to 19; input TargetRate @; TargetRate = round(TargetRate, 1/8); /* round to nearest 1/8 */ output; end; drop i; /* TimePeriod Rate1-Rate19 */ datalines; 2024 4.4 4.6 4.6 4.6 4.6 4.6 4.6 4.6 4.6 4.6 4.9 4.9 4.9 4.9 4.9 5.1 5.1 5.4 5.4 2025 2.6 3.1 3.1 3.3 3.6 3.6 3.6 3.6 3.6 3.9 3.9 3.9 3.9 3.9 3.9 4.1 4.4 4.4 5.4 2026 2.4 2.4 2.5 2.6 2.9 2.9 2.9 2.9 2.9 3.1 3.1 3.1 3.1 3.1 3.1 3.4 3.4 3.6 4.9 Longer Run 2.4 2.5 2.5 2.5 2.5 2.5 2.5 2.5 2.5 2.6 2.75 3.0 3.0 3.0 3.1 3.5 3.5 3.75 . ; |

Let's use PROC SGPLOT in SAS to visualize the data by using default attributes for the markers and axes. You can use the JITTER option on the SCATTER statement to avoid overplotting repeated values. This ensures that all members' forecasts are visible.

title 'Federal Open Market Committee''s "Dot Plot"'; title2 "First View"; proc sgplot data=FedRates noautolegend; scatter x=TimePeriod y=TargetRate / jitter; yaxis values=(0 to 7 by 0.5) grid; run; |

The graph shows the default visualization of the Fed's forecasts. The next sections modify the plot to mimic the Fed's layout and attributes.

The default scatter plot is sufficient for visualizing the Fed's forecasts, but let's see how closely we can mimic the Fed's version of this graph. Here are a few things that I noticed in the Fed's dot plot:

- The plot is taller than it is wide. You can change the shape of a SAS graph by using the ODS GRAPHICS statement.
- Most horizontal grid lines are dotted. Furthermore, there are minor ticks on the axes that are unlabeled. You can control grid lines by using the YAXIS statement. You can add one minor tick mark between major ticks by using the MINOR MINORCOUNT=1 options.
- The reference lines at integer values are solid. You can use the REFLINE statement to add these horizontal reference lines.
- There is a vertical reference lines that divides the "longer run" forecasts from the near-term forecasts. You can use the REFLINE statement to add this vertical reference line. You need to use the DISCRETEOFFSET option to specify that the line should be offset from its associated tick mark.
- The vertical axis on the left is not labeled. Instead, there is an axis on the right. You can use the Y2AXIS option to mimic the Fed's design, but I prefer to display axes on both the left and the right sides. You can display a second axis by adding a second SCATTER statement and, to prevent overplotting, you can use the SIZE=0 option to make the second plot invisible.
- The Fed's dot plot uses filled markers. You can use the MARKERATTRS= option to set the attributes of markers.

The following SAS statements use PROC SGPLOT to emulate the Fed's design for the dot plot:

ods graphics / width=480px height=600px; title 'Federal Open Market Committee''s "Dot Plot"'; %let axisOpts = values=(0 to 7 by 0.5) minor minorcount=1 grid minorgrid gridattrs=(pattern=dot); proc sgplot data=FedRates noautolegend; refline 0 to 7 / axis=y; /* reflines at integers */ refline 'Longer Run' / axis=x discreteoffset=-0.5 lineattrs=(pattern=dot); /* divider before "longer run" */ scatter x=TimePeriod y=TargetRate / jitter markerattrs=(symbol=circlefilled); /* main plot */ scatter x=TimePeriod y=TargetRate / markerattrs=(size=0) y2axis; /* invisible: adds Y2 axis */ yaxis &axisOpts; /* Y axis options */ y2axis &axisOpts display=(nolabel); /* Y2 axis options */ run; |

As Jeanna Smialek wrote in her *NYT* article, economists "fixate most intently on the middle dot.... That middle, or median, ... is the clearest estimate of where the central bank sees policy heading."
The median value is not shown explicitly on the dot plot, but it is shown in a table on a separate page.
Let's add median information to the dot plot.

Some of the other graphs in the same report display medians. For example, the following graph is from page 3 of the report for March 20, 2024. It shows the change in real GDP for five previous years and forecasts for the next three years and for the "longer run." Notice that this graph uses box plots to show the distribution of the FOMC forecasts, including the median. This shows that the Fed thinks that its target audience is sophisticated enough to understand and interpret a box plot.

One possible modification is to add a box plot to the Fed's dot plot. If you use the VBOX statement, you can also use the DISPLAYSTAT=(MEDIAN) option to automatically add a table to the plot that displays the median value for each year. Furthermore, you can use the CONNECT=MEDIAN option to overlay a line that connects the median values for each category, as follows:

title 'Federal Open Market Committee''s "Dot Plot"'; title2 'Overlay a Box Plot'; %let axisOpts = values=(0 to 7 by 0.5) minor minorcount=1 grid minorgrid gridattrs=(pattern=dot); proc sgplot data=FedRates noautolegend; refline 0 to 7 / axis=y; refline 'Longer Run' / axis=x discreteoffset=-0.5 lineattrs=(pattern=dot); vbox TargetRate / category=TimePeriod boxwidth=0.8 fillattrs=(transparency=0.5) medianattrs=(thickness=2) displaystats=(Median) nomean nooutliers nocaps connect=median; /* box plot */ scatter x=TimePeriod y=TargetRate / jitter markerattrs=(symbol=circle); /* main plot */ scatter x=TimePeriod y=TargetRate / markerattrs=(size=0) y2axis; /* invisible: add Y2 axis */ yaxis &axisOpts; y2axis &axisOpts display=(nolabel); run; |

I like the table at the bottom of the plot that displays the median forecast. However, I don't think that the trend line is strictly necessary. In general, I think this plot is a little too "busy" for a public audience. The box plots obscure the individual forecasts. Furthermore, the sample medians often equal the first or third quartiles (because of tied values), which correspond to the lower or upper edges of the boxes.

It might be better to omit the boxes but keep the median indicator. There are a few ways to do that. The simplest is to use PROC MEANS to compute the median for each category, then use a high-low plot to display the median. The following statements write the medians to a SAS data set, merge them into the data set, and then use the HIGHLOW statement to overlay the medians on the dot plot. To make the medians more visible, you can display the markers as open circles instead of filled circles, as follows:

title 'Federal Open Market Committee''s "Dot Plot"'; title2 'Overlay a Median'; %let axisOpts = values=(0 to 7 by 0.5) minor minorcount=1 grid minorgrid gridattrs=(pattern=dot); proc sgplot data=All noautolegend; refline 0 to 7 / axis=y; refline 'Longer Run' / axis=x discreteoffset=-0.5 lineattrs=(pattern=dot); /* https://blogs.sas.com/content/graphicallyspeaking/2017/06/16/scatter-mean-value/ */ highlow x=TimePeriod low=Median high=Median / nofill type=bar barwidth=0.8 lineattrs=GraphData1(thickness=2 color=CxEC1D26); /* display medians */ scatter x=TimePeriod y=TargetRate / jitter markerattrs=(symbol=circle); /* main plot */ scatter x=TimePeriod y=TargetRate / markerattrs=(size=0) y2axis; /* invisible: add Y2 axis */ /* make an invisible boxplot, but add the median as an XAXISTABLE */ vbox TargetRate / category=TimePeriod displaystats=(Median) transparency=1; yaxis &axisOpts label="Target Rate (%)"; y2axis &axisOpts display=(nolabel); run; |

This is my favorite version of the Fed's dot plot. It enables the reader to quickly understand the median values, observe trends, and see the distribution of the FOMC forecasts.

The US Federal Reserve regularly releases a dot plot that communicates information about the possible future direction of interest rates. I have shown how to interpret the Fed's dot plot and how to create a similar plot by using SAS. I have also suggested a simple but effective modification to the Fed's dot plot: Add a graphical indicator of the median and a tabular display of the median for each time period.

The post Improve the Federal Reserve's dot plot appeared first on The DO Loop.

]]>The post Add a second axis to a SAS graph appeared first on The DO Loop.

]]>
But what if you want an axis on both the left *and* the right? (Or both the top and bottom!)
This can be useful when your graph is very wide or tall.

In SAS, one way to create two axes is to duplicate the primary plot, but make the second plot *invisible*!
Using an invisible plot to generate a second axis is an extremely useful SAS tip for creating statistical graphics. This trick was used countless times by Warren Kuhfeld to create innovative graphs. It was Warren who emphasized to me how valuable it can be to create an invisible plot.

When you create the invisible plot, you should use the Y2AXIS or X2AXIS option to specify which axis you are adding. There are two ways to create an invisible plot:

- For plots that display markers (such as a scatter plot), you can set the size of the markers to zero by using the option MARKERATTRS=(SIZE=0). Similarly, you can make lines invisible by using the LINEATTRS=(THICKNESS=0) option.
- Many plots that display areas (such as a bar chart), have options such as NOFILL and NOOUTLINE that you can use to make cells invisible.
- Almost every plot supports the TRANSPARENCY=1 option, which makes the plot completely transparent.

This article demonstrates how to use the invisible plot trick to create a second axis on the right for a scatter plot. A second example shows how to display duplicate axes for a bar chart.

By default, the vertical axis is on the left (the Y axis) and the horizontal axis is at the bottom (the X axis). You can use the Y2AXIS option and the X2AXIS options to switch the location of the axis. For example, the following statements create a scatter plot that has the axis on the right side:

title "Axis on the Right"; proc sgplot data=sashelp.cars; scatter x=Weight y=MPG_City / y2axis; /* move the axis to the right */ run; |

To display two axes, create the graph as usual, which will add a vertical axis on the left. You can use the YAXIS statement to control the attributes of the vertical axis. If you want to also display a copy of the axis on the right, create an invisible plot. For the scatter plot, and easy way to get an invisible plot is to set MARKERATTRS(SIZE=0)), as below. However, you could also use the TRANSPARENCY=1 option.

title "Two Identical Axes"; title2 "An Invisible Plot"; proc sgplot data=sashelp.cars; scatter x=Weight y=MPG_City; /* axis on the left */ yaxis grid; scatter x=Weight y=MPG_City / markerattrs=(size=0) y2axis; /* invisible plot; axis on the right */ y2axis display=(nolabel); /* optional: suppress label and/or values */ run; |

If you create an invisible plot that is associate with both the X2 axis and the Y2 axis, you can display four axes. The following horizontal bar chart is tall, so it is useful to have axes at the top and the bottom. Some of the bars are short and some are long, so it is helpful to have the categories displayed at the left and right. So that the visible and invisible plots have the same attributes, I like to define a macro variable that specifies the options, as follows:

ods graphics / width=400px height=600px; title "A Bar Chart with Two Sets of Axes"; %let opt = response=Weight categoryorder=respasc; proc sgplot data=sashelp.class noautolegend; hbar name / &opt; /* axis on the bottom and left */ xaxis grid label="Weight"; yaxis grid display=(nolabel); /* invisible plot puts axis on the top and right */ hbar name / &opt x2axis y2axis transparency=1; x2axis display=(nolabel); y2axis display=(nolabel); run; |

This article provides examples that show how to use the X2AXIS and Y2AXIS options to control the location of horizontal and vertical axes, respectively. To display an identical set of axes on both sides of a graph, create an invisible plot and associate it with the X2 or Y2 axis.

The post Add a second axis to a SAS graph appeared first on The DO Loop.

]]>The post The likelihood ratio test for linear regression in SAS appeared first on The DO Loop.

]]>The restriction of the log-likelihood function to a lower dimensional subspace has a nice geometric interpretation. It is also the basis for the likelihood ratio (LR) test, which enables you to compare two related models and decide whether adding more parameters are results in a better model.

This article uses SAS to illustrate two methods to compute the likelihood ratio test for nested linear regression models. The first method calls the GENMOD procedure twice and uses a DATA step to compute the LR test. The second method computes the LR test from "first principles" by using a SAS IML program.

Before we discuss submodels in an MLE context, let's briefly review submodels for ordinary least squares (OLS) regression. A previous article discusses the geometry and formulation of a restricted regression model by using the RESTRICT statement in PROC REG. Both the RESTRICT statement and the TEST statement in PROC REG enable you to compare a model that has many parameters with a submodel that has one or more restrictions on the parameters.

In PROC REG, you can use the TEST statement to test a null hypothesis that the submodel (often called a reduced model) is just as valid as the full model. PROC REG runs an F test to potentially reject the null hypothesis. I don't want to discuss the details of linear models, but essentially the test statistic is a ratio. The numerator is the difference between the sums-of-squares of the reduced model and the denominator is the sum-of-squares for the full model. There are also some scaling factors; for details see the Penn State "STAT 501" lecture notes.

The formula for the log-likelihood function of the simple regression model is discussed in
a previous article.
If you define \(r_i = y_ i-x_i \beta\), then the log-likelihood function is

\(
\mathit{ll}(\beta, \sigma; y, x) = -\frac{1}{2} \left[ n \log( \sigma^2 ) + n \log (2 \pi ) + \sum_i \frac{r^2}{\sigma^2 } \right]
\)

For the simple linear regression model, the parameters are (β0, β1, σ).

When you use MLE to fit nested models, you can use geometry to visualize the restricted (or reduced) log-likelihood function. For example, the graph to the right shows a heat map for the full log-likelihood function for a data set (N=50) that is simulated from the model with (β0, β1, σ)=(10, 0.5, 1.5). The colors correspond to the height of the function for each pair of estimates (b0, b1) when the scale parameter (σ) has the value σ = 1.5.

The linear subspace where b1=0 is shown as a black horizontal line. If you slice the graph of the log-likelihood function along the line, you obtain the reduced log-likelihood function as a cross-section of the full function, as shown below:

In the full model (the heat map), the largest value of the function (for σ=1.5) is -97.11 at (b0, b1) = (9.7, 0.8). If you restrict b1 to the line where b1=0, then the restricted log-likelihood (LL) function has a smaller maximum value. For the restricted LL function, the maximum value is -97.66 at (b0, b1) = (10.1, 0).

In the previous section, the log-likelihood values for the full and reduced models are similar. Consequently, you might wonder whether there is any significant advantage to using the full model (three free parameters) over the reduced model (two free parameters because b1=0).

In PROC REG, you can use the TEST statement to test the null hypothesis that β1=0. If the test rejects the null hypothesis, then the full model is significantly better. If the test fails to reject, then you might as well use the simpler reduced model to describe the data.

In the MLE framework, you can use the likelihood ratio test to compare a complex (full) model to any nested submodel. You must compute two quantities:

- Compute LL
_{F}, the maximum value of the full log-likelihood function. This is sometimes called the unrestricted maximum likelihood value. - Compute LL
_{R}, the maximum value of the restricted log-likelihood function. This is sometimes called the restricted maximum likelihood value.

The likelihood ratio test uses the statistic

LR = 2(LL_{F} – LL_{R})

If you assume that the MLE estimates are asymptotically normal (and a few other assumptions, as explained in the StatLect notes about the likelihood ratio test), then the test statistic follows a chi-squared distribution with *r* degrees of freedom, where *r* is the difference between the number of free parameters in the full and reduced models. (Note: I like to I use the test statistic 2*abs(LL_{F} – LL_{R}), which doesn't require keeping track of which value is for the full model and which is for the reduced model.)
In the simple regression example, there are three free parameters in the full model and two free parameters in the reduced model, so *r*=1, and you should compare the test statistic to a critical value for the chi-squared distribution with 1 degree of freedom.

The next sections show two ways to carry out the likelihood ratio (LR) test in SAS. The first method runs PROC GENMOD two times and uses the DATA step to compute the likelihood ratio test, as described in this SAS Note about the likelihood ratio test. The second method is is a "manual" computation in SAS IML that shows exactly how the LR test works.

Notice that the LR test uses the difference of two log-likelihood values. The likelihood *ratio* test gets its name because the difference between two log-likelihoods is equal to the logarithm of the ratio of the likelihoods.

Let's analyze a full and reduced model for the simulated data (called Sim) from the previous article about MLE for linear regression models.

The following SAS program calls PROC GENMOD twice, first for the full model and then again for the reduced (nested) submodel.
For both calls, the ModelFit table is written to a data set.
The ModelFit table contains the log-likelihood (LL) value at the parameter estimates in the 6th row, and also contains the error degrees of freedom for the full and reduced models.
You can then use a DATA step to read the LL values and to form the test statistic 2*abs(LL_{F} – LL_{R}).
You can then compare the test statistic with the critical value of the chi-squared distribution with the appropriate degrees of freedom. Since the test will provide a one-sided p-value, you can use the SDF function instead of computing (1 – CDF) to obtain the right-tail probability.

/* Call PROC GENMOD twice and use DATA step to compute LR test. See https://support.sas.com/kb/24/474.html */ proc genmod data=Sim plots=none; Full: model y = x; ods select ModelFit; ods output ModelFit=LLFull; run; proc genmod data=Sim plots=none; Reduced: model y = ; ods select ModelFit; ods output ModelFit=LLRed; run; /* "Full Log Likelihood" is the 6th row in the data set */ data LRTest; retain DF; keep DF LLFull LLRed LRTest pValue; merge LLFull(rename=(DF=DFFull Value=LLFull)) LLRed (rename=(DF=DFRed Value=LLRed)); if _N_ = 1 then DF = DFRed - DFFull; if _N_ = 6 then do; LRTest = 2*abs(LLFull - LLRed); /* test statistic */ pValue = sdf("ChiSq", LRTest, DF); /* p-value Pr(X > LRTest) where X~ChiSq(DF) */ output; end; label LLFull='LL Full' LLRed='LL Red' LRTest='LR Test' pValue = 'Pr > ChiSq'; run; proc print data=LRTest noobs label; run; |

The likelihood-ratio test statistic is 0.87436. In the chi-squared distribution with 1 degree of freedom, the probability is about 0.35 that a random variate is to the right of this value. Because this probability is greater than 0.05, you should not reject the null hypothesis that the reduced model (β1=0) explains the data better than the full model. Therefore, use the reduced model for these data.

The Appendix shows the same computation performed "manually" in SAS IML by optimizing the log-likelihood function for the full and reduced models. The Appendix shows that you get the same values for the LR test.

This article shows how to use SAS to perform a likelihood ratio test for nested linear regression models. The example in the body of the article calls the GENMOD procedure twice and uses a DATA step to compute the LR test. The Appendix shows a second method, which is to compute the LR test from "first principles" by using a SAS IML program to optimize the full and reduced likelihood functions.

Although the example in this article is a linear regression model, the general idea applies to other nested models. For a description of the general case, see the StatLect lecture notes.

You can use PROC IML in SAS to compute the LR test from "first principles" by using a SAS IML program to optimize the full and reduced likelihood functions. The example uses the same simulated data (called Sim) as the rest of the article. The data are defined in the previous article about MLE for linear regression models.

The first step is to find the parameter values that maximize the log-likelihood (LL) function for the full model and to evaluate the LL function at the optimal value. This was done in the previous article, but it is repeated below:

proc iml; /* the data are defined in https://blogs.sas.com/content/iml/2024/03/20/mle-linear-regression.html */ use Sim; read all var {"x" "y"}; close; X = j(nrow(x),1,1) || x; * design matrix; /* the three-parameter model (beta0, beta1, sigma) */ start LogLikRegFull(parm) global(X, y); pi = constant('pi'); b = parm[1:2]; sigma = parm[3]; sigma2 = sigma##2; r = y - X*b; LL = sum( logpdf("normal", r, 0, sigma) ); return LL; finish; /* max of LogLikRegFull, starting from param0 */ start MaxLLFull(param0) global(X, y); /* b0 b1 sigma constraint matrix */ con = { . . 1E-6, /* lower bounds: none for beta[i]; 0 < sigma */ . . .}; /* upper bounds: none */ opt = {1, /* find maximum of function */ 0}; /* do not print durin optimization */ call nlpnra(rc, z, "LogLikRegFull", param0, opt, con); return z; finish; param0 = {11 1 1.2}; /* initial guess for full model */ MLFull = MaxLLFull( param0 ); LLF = LogLikRegFull(MLFull); /* LL at the optimal parameter */ print LLF[F=8.4], MLFull[c={'b0' 'b1' 'RMSE'} F=7.5]; |

The maximum LL value for the full model is LL_{F} = -96.507. We know that restricting the domain of the LL function will result in a smaller optimal value, but we have previously seen the graph of the restricted LL function, and we know that the maximum value of the restricted LL function is only slightly smaller. The following SAS IML statements find the optimal parameter values for the reduced model and evaluate the restricted LL function:

/* the two-parameter model (beta0, 0, sigma) */ start LogLikRegRed(parm) global(X, y); p = parm[1] || 0 || parm[2]; return( LogLikRegFull( p ) ); finish; /* max of LogLikRegReded, starting from param0 */ start MaxLLRed(param0) global(X, y); /* b0 sigma constraint matrix */ con = { . 1E-6, /* lower bounds: none for beta[i]; 0 < sigma */ . .}; /* upper bounds: none */ opt = {1, /* find maximum of function */ 0}; /* do not print durin optimization */ call nlpnra(rc, z, "LogLikRegRed", param0, opt, con); return z; finish; param0 = {11 1.2}; /* initial guess */ MLRed = MaxLLRed( param0 ); /* what is the LL at the optimal parameter? */ LLR = LogLikRegRed(MLRed); print LLR[F=8.4], MLRed[c={'b0' 'b1' 'RMSE'} F=7.5]; |

The maximum LL for the reduced model is LL_{R} = -96.944. From these two values, you can construct the test statistic for the likelihood ratio test. You can then compare the test statistic with the critical value of the chi-squared distribution with 1 degree of freedom. Since the test will provide a one-sided p-value, you can use the SDF function instead of computing (1 – CDF) to obtain the right-tail probability:

/* LL ratio test for null hypothesis of restricted model */ LLRatio = 2*abs(LLF - LLR); /* test statistic */ DF = 1; pValue = sdf('ChiSq', LLRatio, DF); /* Pr(z > LLRatio) if z ~ ChiSq with DF=1 */ print DF LLF LLR LLRatio[F=7.5] pValue[F=PVALUE6.4]; |

The manual implementation of the likelihood-ratio test is the same as provided by running PROC GENMOD twice. The test statistic of 0.87436 has a p-value about 0.35. Again, the test suggests that you reject the null hypothesis. The reduced regression model that has β1=0 is sufficient to explain the data.

The post The likelihood ratio test for linear regression in SAS appeared first on The DO Loop.

]]>The post Maximum likelihood estimates for linear regression appeared first on The DO Loop.

]]>The analyst wanted to know how to interpret "the scale parameter" in the model.

This article reviews the MLE process for linear models and compares the output of PROC REG and PROC GENMOD. It shows how to use nonlinear optimization to optimize the log-likelihood function for the linear regression model, thus reproducing the GENMOD output.

PROC GENMOD and PROC REG differ in the way that they estimate the coefficients of a linear regression model. PROC REG uses the method of ordinary least squares (OLS), which is a direct method. In contrast, PROC GENMOD uses maximum likelihood estimation (MLE), which is a general method that can apply to many regression models, not just linear models. As part of the MLE computation, the GENMOD procedure must estimate one parameter (the "scale parameter") that the OLS method does not estimate directly. Rather, OLS obtains the "scale parameter" as a consequence of the least squares process.

Both methods assume that the regression model has the matrix form Y = Xβ + ε, where ε is a vector of independent errors. Often, the errors are assumed to be normally distributed as N(0, σ).

The simplest linear regression model is for one continuous regressor: y = β0 + β1 x + ε. The following SAS DATA step simulates n=50 observations from this model for the parameter values (β0, β1, σ) = (10, 0.5, 1.5).

/* simple linear model y ~ betas0 + beta1*x + eps for eps ~ N(0,sigma) */ data Sim; call streaminit(4321); array beta[0:1] (10, 0.5); /* beta0=10; beta1 = 0.5 */ sigma = 1.5; /* scale for the distribution of errors */ N = 50; do i = 1 to N; x = i / N; /* X is equally spaced in [0,1] */ eta = beta[0] + beta[1]*x; /* model */ y = eta + rand("Normal", 0, sigma); output; end; run; |

The next sections analyze these simulated data twice: first by using OLS, then by using MLE.

The OLS method estimates the regression coefficients by solving the normal equations (X'X)*b* = X`Y for *b*.
The predicted values are X**b*, and the residuals are r = Y - X**b*. If you assume that the model is correctly specified and the error terms are normally distributed, you can estimate σ from the distribution of the residuals. PROC REG and other SAS procedures first estimate the sum of squared errors (SSE) as Σ r_{i}^{2}, then estimate σ as the root mean squared error (RMSE), which is sqrt(SSE/(*n-p*)), where *n* is the number of observations and *p* is the number of regression parameters.
Thus, I often think of the OLS method as a two-step method: first, estimate the β coefficients, then use those values to estimate σ as the RMSE.

proc reg data=Sim plots(only)=fit; model y = x; run; |

The call to PROC REG produces a fit plot that enables you to visualize the fit. The vertical distances from the markers to the regression line are the (absolute) residuals. The procedure outputs the parameter estimates and the RMSE statistic, which estimates σ:

I have highlighted a few rows of output. Recall that the data are a random sample of size n=50 from a model whose parameters are (β0, β1, σ) = (10, 0.5, 1.5). The parameter estimates are (9.7, 0.77, 1.7), which are close to the parameter values. The next section uses PROC GENMOD to obtain the MLE estimates for the same parameters.

In contrast to the "two step" OLS method, the MLE method estimates σ and the β coefficients simultaneously. Look at the output from the following call to PROC GENMOD:

proc genmod data=Sim plots=none; model y = x; run; |

Notice that the output from PROC GENMOD includes the estimate for the scale of the error distribution as part of the parameter estimates table. The output also includes a NOTE that reminds you that the scale parameter was estimated as part of the solution. In this table, the estimates for the regression coefficients ("the betas") are the same as for PROC REG (to four decimals), but the estimate of the scale parameter is smaller. This will always be the case. Recall that the least square method estimates σ as sqrt(SSE/(*n-p*)), where *n* is the number of observations and *p* is the number of regression parameters. In contrast, the MLE method uses sqrt(SSE/(*n*), which is the "population" formula for the standard deviation of the residuals. (Recall that the *population standard deviation* has *n* in the denominator whereas the *sample standard deviation* uses *n*-1.) Thus, the MLE method uses a larger denominator, which makes the estimate smaller than the OLS estimate.

PROC GENMOD also outputs a table that shows some goodness-of-fit statistics:

I've highlighted the value of the log-likelihood function evaluated at the parameter estimates. The next section shows how to construct the log-likelihood function for a linear regression model.

You can find online descriptions of the maximum likelihood equations for linear regression. For the linear model, you can explicitly solve for the parameter values that maximize the log-likelihood function. The MLE estimates for the regression coefficients are the same as for the least squares method; the estimate for the scale parameter is always the (population) standard deviation of the residuals. Consequently, you can obtain the MLE estimates "directly" by using the least squares solution. However, for the sake of demonstrating how the MLE estimates are computed in PROC GENMOD, this section constructs the log-likelihood function for the linear regression model and finds performs nonlinear optimization to find the parameter values that maximize the log likelihood.

The log-likelihood function is a function of the coefficients, β, and of the unknown parameter, σ, which is the scale parameter for the "noise" or the magnitude of the error term. It requires the data value to evaluate the function. If you assume normally distributed errors, the MLE equations are (see the StatLect notes):

\(
\mathit{ll}(\beta, \sigma; y, x) = -\frac{1}{2} \sum_i \left[ \frac{(y_ i-x_i \beta)^2}{\sigma^2 } + \log( \sigma^2 ) + \log (2 \pi ) \right]
\)

If you define \(r_i = y_ i-x_i \beta\), then the equation becomes

\(
\mathit{ll}(\beta, \sigma; y, x) = -\frac{1}{2} \left[ n \log( \sigma^2 ) + n \log (2 \pi ) + \sum_i \frac{r^2}{\sigma^2 } \right]
\)

Because the scale parameter appears as σ^{2}, sometimes the variance is used as the scale parameter in the MLE equations. However, PROC GENMOD uses σ, and I will do the same.

It is instructive to "manually" maximize the likelihood function, given the data. You can type in the previous formula, but the simpler way is to sum the SAS-supplied LOGPDF function for the normal distribution, as discussed in a previous article.

proc iml; use Sim; /* read the data */ read all var {"x" "y"}; close; X = j(nrow(x),1,1) || x; * design matrix; start LogLikReg(parm) global(X, y); pi = constant('pi'); b = parm[1:2]; sigma = parm[3]; sigma2 = sigma##2; r = y - X*b; /* you can use the explicit formula: n = nrow(X); LL = -1/2*(n*log(2*pi) + n*log(sigma2) + ssq(r)/sigma2); but a simpler expression uses the sum of the log-PDF */ LL = sum( logpdf("normal", r, 0, sigma) ); return LL; finish; |

If you draw a heat map of the log-likelihood function for σ=1.5, you get the following picture. From the heat map, you can guess that the log-likelihood function achieve a maximum somewhere close to the center of the graph. For reference, I have overlaid lines at the parameter values (β0, β1) = (10, 0.5).

The graph suggests reasonable values for an initial guess for a nonlinear optimization. The following call to the NLPNRA subroutine in SAS IML performs the maximum likelihood optimization:

/* set constraint matrix, options, and initial guess for optimization */ /* b0 b1 sigma constraint matrix */ con = { . . 1E-6, /* lower bounds: none for beta[i]; 0 < sigma */ . . .}; /* upper bounds: none */ opt = {1, /* find maximum of function */ 0}; /* do not print durin optimization */ param0 = {11 1 1.2}; /* initial guess */ call nlpnra(rc, MLEest, "LogLikReg", param0, opt, con); /* what is the LL at the optimal parameter? */ LLopt = LogLikReg(MLEest); print LLopt[F=8.4], MLEest[c={'b0' 'b1' 'RMSE'} F=7.5]; |

The results of the nonlinear optimization are the same parameter estimates that were obtained by using PROC GENMOD. In addition, the log-likelihood function evaluate at the optimal values is the same as the "full likelihood" value that is reported by PROC GENMOD.

This article uses nonlinear optimization in PROC IML to reproduce the results of the maximum likelihood estimates for a PROC GENMOD regression model. The article shows how to define the log-likelihood function that is maximized. It also shows that the estimates for the magnitude of the error term is different between the MLE and OLS methods, and that the MLE estimate is smaller. This is the reason that PROC GENMOD and PROC REG obtain different estimates for the same data and model. It also explains why the Parameter Estimates table for PROC GENMOD contains an extra row for the Scale parameter, and why PROC GENMODE displays a note that states, "The scale parameter was estimated by maximum likelihood."

The post Maximum likelihood estimates for linear regression appeared first on The DO Loop.

]]>The post Pizza pi appeared first on The DO Loop.

]]>Pi is a mathematical constant defined as the ratio of a circle's circumference (C) to its diameter (D). In symbols, π = C / D. In terms of the circle's radius, the definition is π = C / (2r). Circles are fundamental objects in mathematics, so mathematicians naturally are interested in properties of circles such as their area. As you will recall from grade school, the area of a circle involves π.

Ancient mathematicians did not have accurate approximations to π or even the formula for the area of a circle. That changed when the Greek mathematician, Archimedes, studied these problems circa 250 B.C. Two pi-related results were:

- Archimedes proposed a method to approximates π by using inscribed and circumscribed regular polygons with a large number of sides.
- Archimedes derived a formula the area of a circle. The derivation involves slicing a circle along diameters and rearranging the slices to approximate a rectangle.

Archimedes solved both problems by using the concept of a mathematical limit centuries before calculus was formalized! His derivation of the formula for a circle's area is fascinating. Modern descriptions often use a large round pizza as a visual aid. As the pizza is cut into more and more slices, the slices can be arranged to form a figure that looks more and more like a rectangle, as shown in the figure to the right. You can find many videos that use animation to describe the method. In this article, I write a SAS program to visualize the "pizza pi" method.

To find the formula for the area of a circle, Archimedes reasoned as follows. Suppose you cut a circle along a diameter to create N equal sectors.
You can then rearrange the sectors linearly, alternating the points up and down, as shown in the previous pizza image. The resulting shape looks a bit like a rectangle.
The height of the almost-rectangle is *r*, the radius of the circle. As you can see in the pizza image, half the crust forms the rectangle's top and half forms it's bottom.
Therefore, the base of the almost-rectangle has length C/2,
where C is the circumference of the circle (the crust of the pizza).

This is illustrated by the following graphs. The first shows a circle (or pizza) cut into N=8 equal slices.

You can arrange the slices in a row, alternating the orientation so that half the slices point up and half point down.

Because the left and right edges are slanted, the row of slices resembles a parallelogram more than a rectangle. However, you can make an additional cut to address that concern. Cut the first slice in half, and move one half to the right side of the figure. With that change, the figure has vertical sides and resembles a rectangle that has scalloped edges on the top and bottom, as shown in the following image. This is the mathematical version of the pizza image that was shown previously.

Whereas N=8 slices provides a crude approximation to a rectangular, using N=16 slices provides a better approximation, as shown below:

Although the mathematical formalization of limits was still 1,800 years in the future,
Archimedes recognized that as the number of slices becomes larger and larger (a technique known as the method of exhaustion, for obvious reasons!),
the height of the approximating rectangle approaches *r* and the base approaches C/2.
The area of the rectangle is therefore A = *base*height* = (C/2)*r.
Recall that π is defined as π = C / (2r).
Because the area of the rectangle equals (in the limit) the area of the circle, Archimedes substituted πr for C/2 and—Eureka!—the area of the circle is
(π*r)*r = π r^{2}.

The method of exhaustion provides a formula for the area of a circle.
However, if you want to approximate π numerically, it is better to use Archimedes's method of
inscribed and circumscribed regular polygons.
Archimedes formally proved that π is in the range (3 + 10/71, 3+10/70)
by using a 96-sided regular *n*-gon.
Notice that the right-hand side of the interval is the familiar quantity 22/7, which is an excellent approximation to π by a low-denominator fraction.

As mentioned previously, you can find videos in which mathematicians literally slice of pizza and rearrange the slices. But as you can imagine, applying the method of exhaustion to a physical pizza is, well, exhausting! Consequently, I wrote a SAS program to illustrate the process.

Not only is running a program more accurate and less messy than cutting a pizza, the program enables you to slice the circle into an arbitrary number of slices. If you use a real pizza, you are almost surely going to slice the pizza into 4, 8, 16, 32, ..., slices. These powers of two occur because of the mechanical limitations of subdividing a real physical object. However, the program has no trouble slicing a circle into 17, 23, or 56 equal slices, a feat that is impossible by using kitchen scissors. For example, here is the picture for 56 slices:

As expected, the image with this many slices looks very much like a rectangle. If you were to zoom in on the upper and lower edges of the rectangle, you would see that the edges are still scalloped, but as the slices get thinner and thinner, the edge of each slice gets flatter and flatter. Consequently, the scalloping diminishes as the number of slices increases.

Archimedes used the "method of exhaustion" to derive the area of a circle by cutting it along its diameter and rearranging the equally sized sectors into a shape that resembles a rectangle. Although pizza would not be invented for another two millennia, it is fun to visualize Archimedes's method by cutting and rearranging slices of a pizza "pi". Since I am trying to reduce carbs, I chose to write a SAS program that demonstrates Archimedes's method. The program has an advantage over slicing a pizza in that you can slice the virtual circles into an arbitrary number of equally sized slices. The disadvantage is that you do not get to eat the demonstration after it is finished.

The post Pizza pi appeared first on The DO Loop.

]]>The post A generalized Number-Word Game appeared first on The DO Loop.

]]>If you know how to write the words for each natural number in a language, you can play the Number-Word Game:

- Start with any natural number.
- Write down the word(s) for the integer in the chosen language.
- Count the number of characters in the word. This gives a new natural number.
- Go to (2). Repeat until a portion of the sequence repeats itself, at which point the game ends.

In the previous article, I wrote a SAS program that plays the Number-Word Game in English. I showed that every sequence of integers terminates at the number 4, which is a fixed point for the English game. However, if you use other languages, you can get multiple fixed points or periodic cycles.

This article creates a SAS macro that enables you to play the
Number-Word game in any alphabetic language. I demonstrate the program for Spanish and show that some sequences converge to a fixed point whereas others converge to a period-two cycle.
Then, I play the game in Klingon, a fictional language from the *Star Trek* universe. (Qa'Pla. NuqneH, nuqDaq â€˜oH puchpaâ€™â€˜eâ€™?, which means "Welcome. Where are you from, fellow Klingon?")

Let's start by writing a SAS program to play the Number-Word Game. The following macro contains a DATA step and a PROC PRINT statement. It is based on the program in the previous article, but it has a few differences:

- The macro takes two arguments.
- The first argument is the name of a SAS data set. I suggest you name the data set after the language: Spanish, French, Klingon, etc.
The data set must contain
*k*character variables with the names W1-W*k*. The value of W*i*is the character representation of the number*i*in the selected language. For example, in English, the variables W1, W2, and W3 contain the values "one", "two" and "three", respectively. In Spanish, the variables W1, W2, and W3 contain the values "uno", "dos" and "tres". - The second argument is a natural number (less than
*k*) to use as the initial number in the game. - The program uses the KLENGTH function, which is the preferred function for counting the number of characters in a non-English string.
- The program stops iterating if it encounters a fixed point or a period-two cycle. You could augment the program to detect cycles of higher periods.
- The program can only analyze numbers 1-
*k*; My examples use*k*=30.

%macro NumberWordGame(Language, N0); options nonotes; data IterHistory; keep Iter Num Word Length; set &Language; array W[*] W:; /* W1-W&MaxN */ length Word $100; Iter = 0; PrevNum = .; Num = &N0; if Num > dim(W) then do; put "ERROR: Invalid input: " Num; STOP; end; Word = W[Num]; Length = klength(Word); output; stopCond = (Length=Num | Length=PrevNum); /* stop if reach fixed point or period-two cycle */ do Iter=1 to 20 while (^stopCond); PrevNum = Num; Num = Length; if Num > dim(W) then do; put "ERROR: Invalid value: " Num; STOP; end; Word = W[Num]; Length = klength(Word); /* will become the next Num */ stopCond = (Length=Num | Length=PrevNum); output; end; run; title "The Number-Word Game for &Language: Start from &N0"; proc print data=IterHistory noobs; var Iter Num Word Length; run; options notes; %MEND; |

The next section shows how to define the input data set for the Spanish language.

The following SAS DATA step defines a data set that has 30 character variables named W1-W30. The contents of the data set are the Spanish words "uno", "dos", "tres", ..., "veintinueve", and "treinta". You can type these values directly on the ARRAY statement, or read the values from data lines, as follows:

/* Read the Spanish numbers 1-30 into arrays */ %let MaxN = 30; data Spanish; length Word $100; array W[&MaxN] $100; do i = 1 to &MaxN; input Number Word 5-50; W[i] = Word; end; drop i Number Word; datalines; 1 uno 2 dos 3 tres 4 cuatro 5 cinco 6 seis 7 siete 8 ocho 9 nueve 10 diez 11 once 12 doce 13 trece 14 catorce 15 quince 16 dieciseis 17 diecisiete 18 dieciocho 19 diecinueve 20 veinte 21 veintiuno 22 veintidos 23 veintitres 24 veinticuatro 25 veinticinco 26 veintiseis 27 veintisiete 28 veintiocho 29 veintinueve 30 treinta ; |

The name of the data set is "Spanish." You can specify the data set name and an initial number (less than or equal to 30) to play the Number-Word Game in Spanish, as follows:

%NumberWordGame(Spanish, 19); /* period 2 */ |

When the initial number is 19, the output shows that the generated sequence is 19 → 10 → 4 → 6. The next number would be 4 ("seis" has four letters), so the algorithm stops because it detects that the sequence will repeat {4, 6, 4, 6, ...} forever.

Let's try a different number, 20:

%NumberWordGame(Spanish, 20); /* period 2 */ |

The output is similar. Again, the sequence is attracted to a period-two cycle: 20 → 6 → 4, after which the sequence {6, 4, 6, 4, ...} will repeat forever. Does every number converge to the period-two cycle? No. The Spanish word for 5 is "cinco", which has five letters, therefore 5 is a fixed point. The number 21 is an example that converges to the fixed point at 5:

%NumberWordGame(Spanish, 21); /* fixed point */ |

For the Spanish language, every initial value converges either to 5 or to the period-two cycle {4, 6}.

You can apply the algorithm to any alphabetic language, real or fictional. To demonstrate, let's consider the fictional language of Klingon,
which was developed for the *Star Trek* movies and television series. The Klingon language was created by Marc Okrand, a professor of linguistics at the University of California.
You can read about how to count in Klingon, or just run the following SAS DATA step:

/* Read the Klingon numbers 1-30 into an array */ %let MaxN = 30; data Klingon; length Word $100; array W[&MaxN] $100; do i = 1 to &MaxN; input Number Word 5-50; W[i] = Word; end; drop i Number Word; datalines; 1 wa' 2 cha' 3 wej 4 loS 5 vagh 6 jav 7 Soch 8 chorgh 9 Hut 10 wa'maH 11 wa'maH wa' 12 wa'maH cha' 13 wa'maH wej 14 wa'maH loS 15 wa'maH vagh 16 wa'maH jav 17 wa'maH Soch 18 wa'maH chorgh 19 wa'maH Hut 20 cha'maH 21 cha'maH wa' 22 cha'maH cha' 23 cha'maH wej 24 cha'maH loS 25 cha'maH vagh 26 cha'maH jav 27 cha'maH Soch 28 cha'maH chorgh 29 cha'maH Hut 30 wejmaH ; /* all iterations converge to 3 ("wej") */ %NumberWordGame(Klingon, 20); %NumberWordGame(Klingon, 25); |

The output shows the Klingon version of the Number-Word Game for two input values. Both converge to 3 ("wej") after a few iterations. You can play the game for the values 1–30 to convince yourself that all input values converge to 3 for the Klingon language. This result is very appropriate if you are familiar with the Klingon culture: there is fixed point that is dominant; all other numbers follow a path that leads to the dominant fixed point!

The main limitation in this implementation is that you must create a data set that associates numbers and words
for each language. When I demonstrated the English version of the Number-Word Game, I used the WORDS*w*. format
in SAS to automatically generated the words from the numbers. Alas, SAS does not provide a format that converts numbers to Klingon,
so you must manually input the word for each number.

By running the program for a variety of input values, you can claim that the algorithm converges to a fixed point of a limit cycle.
The correctness of this claim assumes that the language has the following property:
There exists a natural number, G, such that for all natural numbers *n* > G, the character representation of *n* has fewer than *n* characters.
This forces the sequence to be strictly decreasing for *n* > G. You can then manually check the behavior of the sequence for the finite values 1–G
to discover the fixed points and periodic cycles.
In English, G = 5. In Spanish, G = 5.
In Klingon, G = 3.

It's your turn! Want to play the Number-Word Game in your favorite language? Do the following:

- Create a SAS data set that contains the variables W1–W30 that contains the character representation of the numbers 1–30. Use the
**Spanish**data set as an example. - Run the %NumberWordGame macro for one input value to make sure it works. For example, run

**%NumberWordGame(Spanish, 20);**

Does the sequence terminate with a fixed point or a periodic cycle? - Run the %NumberWordGame macro for all inputs 1–30 to determine all possible behaviors. To help, you can run the following macro, which plays the game a specified number of times for a sequence of inputs:
%macro PlayGames(Language, maxN); %do i = 1 %to &maxN; %NumberWordGame(&Language, &i); %end; %mend; %PlayGames(Spanish, 10); /* Play the Spanish game for inputs 1-10 */

- Post a comment and let me know the fixed points and/or periodic cycles for your language. Does any language have a periodic cycle of length 3 or 4?
Use the following template to report your results. Replace the boldface words to match your language.
Note: Your language might have more than one fixed point and/or more than one periodic cycle! Or it might have only a fixed point or only a periodic cycle.

*I ran the Number-Word Game for***Spanish**.

A fixed point is**5**("**cinco**").

A periodic cycle is**4**("**cuatro**") and**6**("**seis**").

The post A generalized Number-Word Game appeared first on The DO Loop.

]]>The post The Number-Word Game appeared first on The DO Loop.

]]>- Start with any positive integer.
- Write down the English word for the integer.
- Count the number of letters in the word. This gives a new positive integer.
- Go to (2). Repeat until a portion of the sequence repeats itself.

Here is an example of playing the Number-Word Game:

- Start with
**7**. - The English word for 7 is
**seven**, which has five letters. Therefore, the next integer is**5**. - The English word for 5 is
**five**, which has four letters. Therefore, the next integer is**4**. - The English word for 4 is
**four**, which has four letters. Therefore, the next integer is**4**. The sequence is repeating itself ( 4, 4, 4, ...), so the game stops.

Here's an interesting fact: For the English version of the Number-Word Game, the sequence ALWAYS reaches 4! In the terminology of discrete dynamical systems,
the number 4 is a globally attracting *fixed point* for the iterative algorithm. No matter what initial natural number you choose, the game always ends with 4.

This game makes a fun programming exercise in SAS because SAS contains a special format (the WORDS*w*. format) that can automatically convert integers to the equivalent English word.
This article shows how to use the WORDS. format to play the Number-Word Game.

The WORDS. format in SAS converts numbers to words. By using the format, you do not have to type the English words for the natural numbers! Instead, you can use the PUTN function to convert an integer into the equivalent English word. You can then use the LENGTH function to find the length of the word. To demonstrate these functions, the following SAS DATA step generates the natural numbers 1–20, the corresponding English word, and the length of the English word:

data NumberWords; length Word $200; do Number = 1 to 20; Word = putn(Number, "WORDS200."); Length = length(Word); output; end; label Length="Number of Letters for Number"; run; proc print data=NumberWords label noobs; var Number Word Length; run; |

The output is shown at the top of this article.
You can use the table to play the Number-Word Game for any integer up to 20. For example, suppose you choose **20** as an initial number.
According to the table, the word **twenty** has **6** letters, so go to row 6.
The word **six** has **3** letters, so go to row 3. The remaining sequence is 3 → 5 → 4,
which ends the game.

You can use the table to convince yourself that every starting number in the table eventually reaches 4, and the game ends.

You can program the Number-Word Game if you know the number of letters in the word that represents each natural number. The following SAS macro uses a DATA step to compute the iteration history of the game when you specify an initial number.

/* English version of the Number-Word Game. You specify an initial natural number. The program shows the iteration of the Number-Word Game, which always reaches the number 4. */ %macro NumberWordIteration(N0); data IterHistory; Iter = 0; N = &N0; Word = putn(N, "WORDS200."); Length=length(Word); output; do Iter=1 to 100 while (N ^= Length); /* stop if reach fixed point */ N = length( Word ); Word = putn(N, "WORDS200."); Length=length(Word); output; end; run; proc print data=IterHistory noobs; run; %MEND; |

With the help of the program, we can play the game for very large numbers. For example, the following statement plays the game for **673**:

%EnglishWordIteration(673); |

Are you surprised that the game terminates after only six steps? That length is fairly typical! The word for 673 has only 25 letters, so the next number is much smaller than the first number. The word twenty-five has only 11 letters, and we've previously noted that every natural number less than 20 terminates at 4.

Let's try it again with an even larger initial number:

%EnglishWordIteration(12345); |

For the number 12,345, the game terminates in only three iterations! The sequence is 40 → 5 → 4.

You can prove that the English version of the Number-Word Game terminates at the number 4 for every starting number. First, prove that the number of letters for the number N is less than N for all N ≥ 5. This implies that the sequence in the game is a strictly decreasing sequence until the sequence becomes 4 or less. In the second step, manually check the result of the game for 1, 2, 3, and 4. That completes the proof.

The following graph shows the number of letters in the English word for each natural number 1–100. The diagonal line is the identity line y=x. All markers for numbers greater than 4 are underneath the identity line. This shows that if N is greater than 4, the number of letters in the word for N is less than N.

On my Windows PC, it appears that the WORDS*w*. format can produce the words for numbers up to 999,999,999. Numbers that are one billion or larger are formatted as "large_number."

For numbers less than one billion, I think the number that has the longest character representation is 777,777,777, which requires 100 characters (including spaces) to represent. The iteration history for that number requires only six iterations before reaching the fixed point: 100 → 11 → 6 → 3 → 5 → 4.

The Number-Word Game is an algorithm that can be easily programmed in SAS by using the WORDS*w*. format and the LENGTH function.
This article shows how to write a simple SAS program that plays the Numbers-Word Game. For English words, the Number-Word Game always terminates after a small number of steps. The number 4 is a global attracting fixed point for the algorithm.

The post The Number-Word Game appeared first on The DO Loop.

]]>The post Using colors to visualize groups in a bar chart in SAS appeared first on The DO Loop.

]]>I sometimes see analysts overuse colors in statistical graphics. My rule of thumb is that you do not need to use color to represent a variable that is already represented in a graph. For example, it is redundant to use a continuous color ramp to represent the lengths of bars in a bar chart. The lengths already indicate the value, so the colors do not add additional information.

However, I sometimes bend the rule when the color represents cutoff values (or binning values) that divide the bar lengths into groups. A canonical example comes from elementary school where teachers assign a letter grade (A, B, C, D, or F) to a student's test score based on some cutoff values. The example in this article uses a 10-point scale to assign letter grades. That is, 90% or above is an "A", 80%-89.9% is a "B", and so forth. Grades below 60% receive a failing grade, "F".

The bar chart to the right shows test scores for 14 students. The bars are colored according to the corresponding letter grades. This article shows how to create a bar chart like this in SAS by using PROC SGPLOT. Along the way, we'll discuss the following data visualization topics:

- How to create and use a user-defined format that bins a continuous variable into discrete ordinal groups.
- How to create and DATA view to avoid creating a new data set that contains the formatted values.
- How to create a color ramp that contains perceptually balanced colors.
- How to create a discrete attribute map that assigns colors to group levels.

You can use the HBAR or VBAR statement in PROC SGPLOT to create a bar chart that shows the scores a set of students. The following program defines some example test scores and displays a bar chart. I'll point out a few data visualization techniques in this example:

- I represent the scores as decimal values (0 ≤ score ≤ 1) so that I can use the PERCENT
*w.d*format to display the scores as percentages. - I prefer to use horizontal bar charts for this task.
- I use the CATEGORYORDER= option on the HBAR statement to sort the bars in the bar chart.
- For now, I use the REFLINE statement to overlay the cut points for the letter grades. In the next section, I remove the reference lines and color the bars instead.

data Tests; format Name $15. score percent6.4; input Name score; datalines; Abbott 0.75 Beth 0.59 Carol 0.90 Derek 0.61 Ed 0.93 Felix 0.70 Garry 0.71 Harry 0.80 Izzy 0.33 Jacob 0.20 Ken 0.60 Lenora 0.69 Mike 0.99 Nancy 0.86 ; title "Bar Chart of Test Grades"; title2 "No Bar Colors"; proc sgplot data=Tests; hbar Name / response=score categoryorder=respdesc; /* or respasc */ refline 0.6 0.7 0.8 0.9 / axis=x; xaxis max=1; /* make sure 100% is max value */ run; |

Because the chart is sorted by bar length, you can easily discern which students performed well on the test and which students are struggling. However, some people might prefer to assign colors to the bars to represent the letter grades.

You can define a custom format that associates a letter to each test score. You can then use the GROUP= option on the HBAR statement to color the bars that share the same letter grade. The following call to PROC FORMAT defines the range for each letter grade by using a 10-point scale. You can create a DATA step view to create a new variable to use for the grouping variable, as follows:

/* use format to bin scores into letter grades https://blogs.sas.com/content/iml/2019/07/15/create-discrete-heat-map-sgplot.html */ proc format; value GradeFmt 0 -< .60 = "F" /* [ 0%, 60%) */ .60 -< .70 = "D" /* [60%, 70%) */ .70 -< .80 = "C" /* [70%, 80%) */ .80 -< .90 = "B" /* [80%, 90%) */ .90 - 1.00 = "A"; /* [90%, 100%] */ run; /* define a DATA step VIEW that creates 'Grade' as a formatted version of the score */ data Tests2 / view=Tests2; set Tests; Grade = score; /* <== new variable */ format Grade GradeFmt.; run; title "Bar Chart of Test Grades"; title2 "Bar Colors"; proc sgplot data=Tests2; hbar Name / response=score categoryorder=respdesc group=Grade; xaxis label="Score" grid max=1 values=(.1 to .9 by .1) valueshint; /* make sure 100% is max value */ run; |

The Grade variable is computed in the DATA step view by applying the GRADEFMT. format to a copy of the Score variable. I make a copy because I want to use the raw score variable for the length of the bars.

At this point, the task is complete. Each bar has a length that indicates the student's test score. Each bar has a color that indicates the letter grade. However, notice that the colors are assigned automatically by cycling through the colors in the current ODS style. The colors do not convey any meaning. If you want to specify a meaningful color for each letter grade, then read on.

Sometimes programmers use a "traffic light" color scheme to encode ordinal categories such as letter grades. In a traffic light encoding, low values are assigned a red color, moderate values are assigned orange or yellow, and high values are assigned green.

SAS enables you to specify colors by using several methods. For example, you can specify colors by using pre-defined color names such as Red, Orange, Yellow, Light Green, and Green. However, recall that the human eye perceives colors in complex ways. Some colors appear brighter—and therefore more important—than other colors. To reduce the bias caused by color-perception, I recommend using colors from palettes that have been carefully designed so that no one color dominates the others. One way to do is to use a palette from the ColorBrewer system of palettes. You can use the PALETTE function in SAS IML to generate color palettes from the ColorBrewer system. For example, the following statements create a five-color palette from the "RdYlGn" family:

/* Use a ColorBrewer 5-color diverging palette */ proc iml; Palette = palette("RdYlGn", 5); print Palette[c={'Red' 'Orange' 'Yellow' 'Light Green' 'Green'}]; QUIT; |

I have previously written about how to create a discrete attribute map that assigns colors to group. In this case, I want to use cutpoints and a SAS format to bin a continuous variable into letter grades. I like to specify the cutpoints and then apply the SAS format to obtain the formatted values. The following DATA step specifies the cutpoints for the left-hand side of the intervals that determine letter grades from test scores. The colors are specified by using the hexadecimal ColorBrewer values from the previous section, but you could specify color names (for example, "Red") if you prefer. The result is a data set that defines a mapping from formatted values ("A", "B", .., "F") to colors.

To use the attribute map, specify the DATTRMAP option on the PROC SGPLOT statement. On the HBAR statement, specify the name of the ID variable that associates values and fill colors. The result is a bar chart in which the colors are red, orange, yellow, and greens.

/* Create a discrete data map that assigns a color to a grade range */ /* https://blogs.sas.com/content/iml/2012/10/17/specify-the-colors-of-groups-in-sas-statistical-graphics.html */ data GradeAttrs; /* create discrete attribute map */ length Value $11 FillColor $15; retain ID 'GradeColors' /* name of map */ Show 'AttrMap'; /* always show all groups in legend */ array cutpts{5} _temporary_(0.0 0.6 0.7 0.8 0.9); /* ('Red' 'Orange' 'Yellow' 'Light Green' 'Green') */ array colors{5} $15 _temporary_("CXD7191C" "CXFDAE61" "CXFFFFBF" "CXA6D96A" "CX1A9641"); do i = 1 to dim(cutpts); Value = put(cutpts[i], GradeFmt.); /* use format to assign values */ FillColor = colors[i]; /* color for this interval */ output; end; drop i; run; title "Bar Chart of Test Grades"; title2 "Discrete Attribute Map"; proc sgplot data=Tests2 dattrmap=GradeAttrs; hbar Name / response=score categoryorder=respdesc group=Grade attrid=GradeColors; xaxis label="Score" grid max=1 values=(.1 to .9 by .1) valueshint; /* make sure 100% is max value */ run; |

The graph is shown at the top of this article. The bars are colored red, orange, yellow, and green, according to the result of applying the GRADEFMT. format to the test score.

Is this example useful if you are not a teacher? Yes, the ideas apply to any bar chart in which you want to use a variable for the length and then use a formatted value to specify bar colors. In this example, the length of the bar and the color were related, but that does not have to be the case in general.

And this example gave us an opportunity to review several important data visualization tricks in SAS, many of which are not restricted to bar charts:

- How to create and use a user-defined format to bin a continuous variable.
- How to create and DATA view to avoid creating a new data set.
- How to use a ColorBrewer system to create a color ramp that is perceptually balanced.
- How to create a discrete attribute map that assigns colors to groups.

The post Using colors to visualize groups in a bar chart in SAS appeared first on The DO Loop.

]]>The post On using flexible distributions to fit data appeared first on The DO Loop.

]]>— John von Neumann

Ever since the dawn of statistics, researchers have searched for the Holy Grail of statistical modeling. Namely, a flexible distribution that can model any continuous univariate data. As the quote John von Neumann indicates, with enough parameters you can create a very flexible system of continuous distributions that can fit a wide range of data shapes. Some well-known flexible distributions include:

- The Pearson system of distributions (circa 1895)
- The Johnson system of distributions (1949)
- Keelin's metalog distribution (Keelin, 2016)
- Power transformations of a normal distribution: Fleishman (1978) and Headrick (2002, 2010) proposed models that use a polynomial transformation of a normal distribution. However, this system cannot model distributions that have extreme skewness and kurtosis, so it will not be considered further in this article.

A reader asked about the result of fitting a flexible distribution to data when the data distribution is a known common distribution such as normal, gamma, uniform, and so forth. Does the flexible model result in the same common distribution that generated the data? Or do you get some different distribution?

For example, suppose you simulate a large sample from the uniform distribution and then fit one of these systems to the uniformly distributed data. Is the PDF of the model the uniform distribution, or is it nonuniform?

It is an interesting question. In general, the answer is that you get a different distribution than the one that generated the data. The PDF of the model is NOT the same as the "parent" data-generating distribution. This article discusses why and provides an example for the Pearson system, the Johnson system, and the metalog system.

The Pearson and Johnson systems have a similar conceptual framework: They fit moments of the data distribution. Specifically, they produce a continuous distribution that has the same skewness and kurtosis as the sample skewness and sample kurtosis in the data. (The mean and standard deviation are not used because you can always scale and translate a distribution without affecting its intrinsic shape.) Graphically, if you plot the sample skewness and kurtosis on a moment-ratio diagram, then each (skewness, kurtosis) pair corresponds to one and only one distribution in the Pearson and Johnson systems.

Consequently, the question becomes, "Which common distributions are exactly represented in the Pearson or Johnson systems?"

The Pearson system decomposes the moment-ratio diagram into 12 regions, each corresponding to a distribution, known as Type I, Type II, ..., Type XII. When you fit the model, you first determine (from the sample skewness and kurtosis) which of the 12 the regions the samples seem to be in. You then fit that chosen model to the data. Not all regions correspond to familiar distributions, but several do:

- The Type I family is the family of beta distributions.
- The Type II family is the family of symmetric beta distributions, which includes the uniform distribution as a special case.
- The Type III family contains gamma distributions.
- The Type V family contains inverse gamma distributions.
- The Type VI family contains F distributions.
- The Type VII family contains Studentâ€™s t distributions.
- The normal distribution (no type number) is a limiting case for several families.

This list is summarized by using the following image of the moment-ratio diagram, which is overlaid with the families in the Pearson system The graph is taken from the documentation of PROC SIMSYSTEM in SAS Viya.

Consequently, if your data were generated by one of the distributions on this list, then when you fit a Pearson model, you have a chance to
recover the true data-generating distribution. For example, if you have uniformly distributed data that has zero skewness and (full) kurtosis equal to 1.8,
then you could fit the data by using a symmetric beta distribution, which would result in the uniform distribution. Conclusion: The Pearson family can fit uniformly distributed data *exactly*.

In contrast, the Johnson system contains only four families: The normal, the lognormal, the SB distribution (which models all bounded distributions), and the SU distribution (for unbounded distributions). Tukey's quote at the top of this article is especially applicable to the Johnson SB and SU distributions, which each contain four parameters!

Consequently, if your data are generated by the normal or lognormal distribution, you can recover the distributional form by using the Johnson system. For other familiar distributions, you cannot.

This might lead you to wonder about fitting a simple distribution such as the uniform. If the Johnson SB distribution (fitted to uniform data) does not result the uniform distribution, then what does the SB model look like? To answer that question, simulate a large data set from the U(0,1) distribution and fit the data to the SB distribution on (0,1) by using PROC UNIVARIATE, which supports the Johnson SB distribution:

/* fit a uniform distribution by using the Johnson SB distribution. What is the PDF of the model? It's NOT a constant function! */ data U; call streaminit(1); do i = 1 to 10000; u = rand("Uniform"); output; end; run; proc univariate data=U; var u; histogram u / SB(theta=0, sigma=1, fitmethod=moments) endpoints=(0 to 1 by 0.05); ods select Moments ParameterEstimates Histogram; run; |

Notice that the PDF of the fitted SB distribution drops off near u=0 and u=1. This means that if you simulated samples from the fitted model, you will undersample data near the endpoints of the interval.

The metalog system does not use moments or the moment-ratio diagram. Rather, it uses the empirical cumulative distribution function of the data or, more precisely, the inverse of the CDF, which is the empirical quantile function. The metalog system is based on the logistic distribution, which is the only "named" distribution that it will fit exactly.

If the metalog does not fit the uniform distribution exactly, then what does the PDF look like if you fit a metalog model to a large uniformly distributed sample?
That question is a little ambiguous because the metalog system requires that you specify the *order* of the family, which is a parameter that determines how
many component functions are used to fit the empirical quantile function.

The Appendix shows how to download the metalog package for SAS IML. Assuming that the functions in the package are stored, the following SAS IML statements fit a 5-term metalog model to the data and plot the results:

proc iml; load module=_all_; /* load the metalog function library */ use U; read all var "u"; close; /* read the data */ /* 5-term bounded metalog on [0,1] */ k = 5; MLObj = ML_CreateFromData(u, k, {0 1}); title "5-Term Metalog Model"; p = do(0,1,0.001); /* cumulative probability values */ call ML_PlotPDF(MLObj, p); /* the density function for the model */ |

The graph is not uniform. The model has greater-than expected density in the middle of the interval and less near the endpoints. There is also a "spike" in the PDF of the model near the endpoints.

You can change the number of terms in the metalog distribution and fit the higher-order model. The result is different, but still has nonuniform probability, as shown below:

/* 9-term bounded metalog on [0,1] */ k = 9; MLObj = ML_CreateFromData(u, k, {0 1}); title "9-Term Metalog Model"; call ML_PlotPDF(MLObj, p); /* the density function for the model */ |

When you fit a flexible distribution to data that are generated by a common distribution (such as the uniform distribution), you should not expect to recover the data-generating distribution. The model will only agree with the data-generating distribution when the flexible system contains a family that is the same as the data-generating distribution.

For example, the Pearson system
contains families for the beta, uniform, gamma, inverse gamma, F, *t*, and normal distributions.
Those distributions can be represented exactly by the system.
The Johnson system can fit the normal and lognormal families.
The metalog system can fit the logistic distribution.
All other data-generating distributions will be approximated by using the flexible distributions in the system, which have many parameters.

This article shows that a flexible model for the uniform distribution might be more complicated than you expect. That is why recommend using classical "named" distributions to model simple data distributions. Reserve the flexible families for data distributions that have complex shapes.

A previous article shows how to download and use the metalog package for SAS IML software. For your convenience, the following statements were used to generate the graphs of the metalog distribution in the current article:

/*************************************/ /* metalog system: https://blogs.sas.com/content/iml/2023/03/13/metalog-sas.html Download metalog package from GitHub (do this only one time) */ /*************************************/ options dlcreatedir; %let repoPath = %sysfunc(getoption(WORK))/sas-iml-packages; /* clone repo to WORK, or use permanent libref */ /* clone repository; if repository exists, skip download */ data _null_; if fileexist("&repoPath.") then put 'Repository already exists; skipping the clone operation'; else do; put "Cloning repository 'sas-iml-packages'"; rc = gitfn_clone("https://github.com/sassoftware/sas-iml-packages/", "&repoPath." ); end; run; /*************************************/ /* Use %INCLUDE to read source code and STORE functions to current storage library */ /*************************************/ proc iml; %include "&repoPath./Metalog/ML_proc.sas"; /* each file ends with a STORE statement */ %include "&repoPath./Metalog/ML_define.sas"; quit; /*************************************/ /* now use the functions in the metalog package */ /*************************************/ proc iml; load module=_all_; /* load the metalog function library */ use U; read all var "u"; close; /* read the data */ /* 5-term bounded metalog on [0,1] */ k = 5; MLObj = ML_CreateFromData(u, k, {0 1}); title "5-Term Metalog Model"; p = do(0,1,0.001); /* cumulative probability values */ call ML_PlotPDF(MLObj, p); /* the density function for the model */ /* 9-term bounded metalog on [0,1] */ k = 9; MLObj = ML_CreateFromData(u, k, {0 1}); title "9-Term Metalog Model"; call ML_PlotPDF(MLObj, p); /* the density function for the model */ |

The post On using flexible distributions to fit data appeared first on The DO Loop.

]]>