The post What is a CAS-enabled procedure? appeared first on The DO Loop.

]]>- What are the computing environments in Viya, and how should a programmer think about them?
- What procedures do you get when you order a programming-oriented Viya product such as SAS Visual Statistics or SAS Econometrics? Of these procedures, which are CAS-enabled?
- If you have legacy SAS programs, can you still run them if your company migrates from SAS 9 to SAS Viya?

I am a programmer, so I thought it might be helpful for me to discuss these topics programmer-to-programmer. In a series of articles, I am going to discuss issues that a SAS statistical programmer might face when migrating to Viya from SAS 9. I use the term "SAS 9" to refer to the SAS Workspace Server that runs procedures in traditional products such as Base SAS, SAS/STAT, and SAS/ETS. So "SAS 9" refers to the most recent version of the "classic" SAS programming environment. It is the version of SAS that existed before SAS Viya was created.

In SAS 9, a procedure runs on the SAS Workspace Server. In SAS 9, the word "client," refers to a program such as Enterprise Guide (EG) or SAS Studio, which runs on a PC and submits code to the SAS Workspace Server. The server computes the results and sends tables and graphs back to the client, which displays them. Typically, the input and output data sets remain on the server.

You can think of SAS Viya as having two main components: the CAS server where the data are stored and the computations are performed, and support for several *client languages*. A client language enables you to connect to the CAS server and tell it what analyses you want to perform. So, in the world of Viya, "client" no longer refers to a GUI like EG, but to an entire programming environment such as SAS, Python, or R. The purpose of the client software is to
connect to CAS, submit actions, and get back results. You then use the capabilities of the client language to display the results as a table or graph. For example, the SAS client uses ODS to display tables and graphs. In Python, you might use matplotlib to graph the results. In R, you might use ggplot. In all cases, you can also use the native capabilities of the client language (DATA step, Pandas, the tidyverse,....) to modify, aggregate, or enhance the output.

I use the SAS client to connect to and communicate with the CAS server. By using a SAS client to communicate with CAS, I can leverage my 25 years of SAS programming knowledge and skills. Others have written about how to use other clients (such as a Python client) to connect to CAS and to call CAS actions.

When you purchase a product in SAS Viya, you get three kinds of computational capabilities:

- Actions, which run on the CAS server. You can call an action from any client language.
- CAS-enabled procedures, which are parsed on the SAS client but call CAS actions "under the covers."
- Legacy SAS procedures that run on the SAS client, just as they do in SAS 9.

Obviously, the CAS-enabled and legacy procedures are only available on the SAS client.

To give an example, SAS Visual Statistic contains action sets (which contain actions),
CAS-enabled procedures, and all the procedures in SAS/STAT.
All procedures run on the SAS compute server, which is also called the *SAS client*.
(The SAS compute server was formerly known as the SAS programming runtime environment, or SPRE.)
However, the CAS-enabled procedures call one or more actions that run on the CAS server, then display the results as ODS tables and graphics.

A CAS-enabled procedure performs very few computations on the client. In contrast, a legacy procedure that is not CAS-enabled performs all of its computations on the SAS client. It does not call any CAS actions. An example of a CAS-enabled procedure is the REGSELECT procedure, which performs linear regression with feature selection. It contains many of the features of the GLMSELECT procedure, which is a traditional regression procedure in SAS/STAT.

The following links are helpful for discovering the names and functionality of CAS-enabled procedures:

- If you work with SAS/STAT procedures in SAS 9, look at the CAS-enabled statistical procedures in SAS Visual Statistics.
- If you work with SAS/ETS procedures in SAS 9, look at the CAS-enabled econometrics procedures in SAS Econometrics or SAS Visual Forecasting.
- Some older Base SAS procedures are CAS-enabled when you run them in SAS Viya. Others, like PROC CASUTIL and PROC MDSUMMARY, are newer and were designed to be CAS-enabled from the beginning.

Naturally, SAS 9 statistical programmers want to make sure that their existing programs will run in Viya. That is why SAS Visual Statistics comes with the legacy SAS/STAT procedures. The same is true for SAS/ETS proceduires, which are shipped as part of SAS Econometrics. And the SAS IML product in Viya contains PROC IML, which runs on the SAS client, as well as the newer iml action, which runs in CAS.

So what happens if, for example, you call PROC REG in SAS and ask it to perform a regression on a SAS data set in the WORK libref? PROC REG will do what it has always done. It will run in the SAS environment. It will not run on the CAS server. It will not magically run faster than it used to in SAS 9. The performance of most legacy programs should be comparable to their performance in SAS 9.

There are some exceptions to that rule. Some SAS procedures have been enhanced and now perform better than their SAS 9 counterparts. For example, the SAS IML team has enhanced certain functions in PROC IML in SAS Viya so that they have better performance in SAS Viya than the SAS 9 version of the procedure. The SAS IML development team is focused exclusively on improving performance and adding features in SAS Viya, both PROC IML and the iml action.

Another exception is that some Base SAS procedures were enhanced so that they behave differently depending on the location of the data. Many Base SAS procedures are now hybrid procedures. If you tell them to analyze a CAS table, they will call an action, which runs in CAS, and retrieve the results. If you tell them to analyze a SAS data set, they will run on the SAS client, just like they do in SAS 9. For example, PROC MEANS will call the aggregation.aggregate action to compute descriptive statistics on variables in a CAS table.

To make the situation more complicated, some of the legacy Base SAS procedures support features that are not supported in CAS. When you request an option that is not supported in CAS, the procedure will download the data from CAS into SAS and perform the computation on the client. This can be inefficient, so check the documentation before you start using legacy procedures to analyze CAS tables. As a rule, I prefer to use legacy procedures to analyze SAS data sets on the client; I use newer CAS-enabled procedures for analyzing CAS tables.

At a recent seminar for SAS 9 programmers, there were lots of questions about SAS Viya and what it means to for a SAS programmer to start programming in Viya. This article is the first of several articles that I intend to write for SAS programmers. I don't know everything, but I hope that other SAS programmers will join me in sharing what they have learned about the process of migrating from SAS 9 to SAS Viya.

If you are a SAS programmer who has general questions about SAS Viya, let me know what you are thinking by leaving a comment. I might not know the answer, but I'll try to find someone who does. For programming questions ("How do I do XYZ in Viya?"), post your question and sample data to the SAS Support Communities. There is a dedicated community for SAS Viya questions, so take advantage of the experts there.

The post What is a CAS-enabled procedure? appeared first on The DO Loop.

]]>The post The order of vertices on a convex polygon appeared first on The DO Loop.

]]>But what if the CVEXHULL function did not output the vertices in sequential order? It turns out that you can perform a simple computation that orders the vertices of a convex polygon. This article shows how.

Let's start by defining six points that form the vertices of a convex polygon. The vertices are not sorted in any way, so if you connect the points in the order given, you get a star-like pattern:

proc iml; P = { 0 2, 6 0, 0 0, 4 -1, 5 2, 2 -1 }; /* create a helper function to connect vertices in the order they are given */ start GraphVertices(P); Poly = P // P[1,]; /* repeat first point to close the polygon */ call series(Poly[,1], Poly[,2]) procopt="aspect=1" option="markers" grid={x y}; finish; title "Connect Unsorted Points"; run GraphVertices(P); |

The line plot does not look like a convex polygon because the vertices were not ordered. However, it is not hard to order the vertices of a convex polygon.

The key idea is to recall that the *centroid* of a polygon is the arithmetic average of the coordinates of its vertices. For a convex polygon, the centroid always lies in the interior of the polygon, as shown in the graph to the right. The centroid is displayed as a large dot in the center of the polygon.

You can use the centroid as the origin and construct the vectors from the centroid to each vertex. For each vector, you can compute the angle made with the horizontal axis. You can then sort the angles, which provides a sequential ordering of the vertices of the convex polygon. In the graph at the right, the angles are given in degrees, but you can use radians for all the calculations. For this example, the angles that the vectors make with the positive X axis are 38 degrees, 150 degrees, 187 degrees, and so forth.

The SAS/IML language provides a compact way to compute the centroid and the vectors. You can use the ATAN2 function in SAS to compute the angles, as follows:

C = mean(P); /* centroid of a convex polygon */ v = P - C; /* vectors from centroid to points on convex hull */ radian = atan2( v[,2], v[,1] ); /* angle with X axis for each vector */ |

At this point, you could sort the vertices by their radian measure. However, the ATAN2 function returns values in the interval (-π π]. If you prefer to order the vertices by using the standard "polar angle," which is in the interval [0, 2π), you can add 2π to any negative angle from the ATAN2 function. You can then use the angles to sort the vertices, as follows:

/* Optional: The ATAN2 function returns values in (-pi, pi]. Convert to values in (0,2*pi] */ pi = constant('pi'); idx = loc( radian<0 ); radian[idx] = radian[idx] + 2*pi; /* now sort the points by angle */ call sortndx(idx, radian); /* get row numbers that sort the angles */ Q = P[idx,]; /* sort the vertices of the polygon by their angle */ title "Connect Sorted Points"; run GraphVertices(Q); |

With this ordering, the vertices are now sorted in sequential order according to the angle each vertex makes with the centroid.

In summary, the CVEXHULL function in SAS/IML returns vertices of a convex polygon in sequential order. But if you are even given the vertices in a random order, you can perform a computation to sort them by using the angle each vertex makes with the centroid.

The post The order of vertices on a convex polygon appeared first on The DO Loop.

]]>The post Two-dimensional convex hulls in SAS appeared first on The DO Loop.

]]>
Given a cloud of points in the plane, it can be useful to identify the *convex hull* of the points.
The convex hull is the smallest convex set that contains the observations. For a finite set of points, it is a convex polygon that has some of the points as its vertices. An example of a convex hull is shown
to the right. The convex hull is the polygon that encloses the points.

This article describes the CVEXHULL function in the SAS/IML language, which computes the convex hull for a set of planar points.

The CVEXHULL function takes a *N* x 2 data matrix. Each row of the matrix is the (x,y) coordinates for a point. The CVEXHULL function returns a column vector of length *N*.

- The first few elements are positive integers. They represent the rows of the data matrix that form the convex hull.
- The positive integers are sorted so that you can visualize the convex hull by connecting the points in order.
- The remaining elements are negative integers. The absolute values of these integers represent the rows of the data matrix that are contained in the convex hull or are on the boundary of the convex hull but are not vertices.

The following example comes from the SAS/IML documentation. The data matrix contains 18 points. Of those, six are the vertices of the convex hull. The output of the CVEXHULL function is shown below:

proc iml; points = {0 2, 0.5 2, 1 2, 0.5 1, 0 0, 0.5 0, 1 0, 2 -1, 2 0, 2 1, 3 0, 4 1, 4 0, 4 -1, 5 2, 5 1, 5 0, 6 0 }; /* Find the convex hull: - indices on the convex hull are positive - the indices for the convex hull are listed first, in sequential order - interior indices are negative */ Indices = cvexhull( points ); reset wide; print (Indices`)[L="Indices"]; |

I have highlighted the first six elements. The indices tell you that the convex hull is formed by using the 1st, 5th, 8th, 14th, 18th, and 15th points of the data matrix. You can use the LOC statement to find the positive values in the `Indices` vector. You can use those values to extract the points on the convex hull, as follows:

hullIdx = indices[loc(indices>0)]; /* the positive indices */ convexHull = points[hullIdx, ]; /* extract rows */ print hullIdx convexHull[c={'cx' 'cy'} L=""]; |

The output shows that the convex hull is formed by the six points (0,2), (0,0), ..., (5,2).

The graph at the beginning of this article shows the convex hull as a shaded polygon. The original points are overlaid on the polygon and labeled by the observation number. The six points that form the convex hull are colored red. This section shows how to create the graph.

The graph uses the POLYGON statement to visualize the convex hull.
This enables you to shade the interior of the convex hull. If you do not need the shading, you
could use a SERIES statement, but to get a *closed* polygon you would need to add the first point to the end of the list of vertices.

To create the graph, you must write the relevant information to a SAS data set so that you can use PROC SGPLOT to create the graph. The following statements write the (x,y) coordinates of the point, the observation numbers (for the data labels), the coordinates of the convex hull vertices (cx, cy), and an ID variable, which is required to use the POLYGON statement. It also creates a binary indicator variable that is used to color-code the markers in the scatter plot:

x = points[,1]; y = points[,2]; obsNum = t(1:nrow(points)); /* optional: use observation numbers for labels */ /* The points on the convex hull are sorted in counterclockwise order. If you use a series plot, you must repeat the first point so that the polygon is closed. For example, use convexHull = convexHull // convexHull[1,]; */ cx = convexHull[,1]; cy = convexHull[,2]; ID = j(nrow(cx),1,1); /* create ID variable for POLYGON statement */ /* create a binary (0/1) indicator variable */ OnHull = j(nrow(x), 1, 0); /* most points NOT vertices of the convex hull */ OnHull[hullIdx] = 1; /* these points are the vertices */ create CHull var {'x' 'y' 'cx' 'cy' 'ID' 'obsNum' 'OnHull'}; append; close; QUIT; |

In the graph at the top of this article, vertices of the convex hull are colored red and the other points are blue. When you use the GROUP= option in PROC SGPLOT statements, the group colors might depend on the order of the observations in the data. To ensure that the colors are consistent regardless of the order of the data set, you can use a discrete attribute map to associate colors and values of the grouping variable. For details about using a discrete attribute maps, see Kuhfeld's 2016 article.

To use a discrete attribute map, you need to define it in a SAS data set, read it by using the DATTRMAP= option on the PROC SGPLOT statement, and specify it by using the ATTRID= statement on the SCATTER statement, as follows:

data DAttrs; /* use DATTRMAP=<data set name> */ length MarkerStyleElement $11.; ID = "HullAttr"; /* use ATTRID=<ID value> */ Value = 0; MarkerStyleElement = "GraphData1"; output; /* 0 ==> 1st color */ Value = 1; MarkerStyleElement = "GraphData2"; output; /* 1 ==> 2nd color */ run; title "Points and Convex Hull"; proc sgplot data=CHull DATTRMAP=DAttrs; polygon x=cx y=cy ID=ID / fill outline lineattrs=GraphData2; scatter x=x y=y / datalabel=obsNum group=OnHull markerattrs=(symbol=CircleFilled) ATTRID=HullAttr; run; |

The graph is shown at the top of this article. Notice that the points (0.5, 2) and (1, 2) are on the boundary of the convex hull, but they are drawn in blue because they are not vertices of the polygon.

In summary, you can compute a 2-D convex hull by using the CVEXHULL function in SAS/IML software. The output is a set of indices, which you can use to extract the vertices of the convex hull and to color-code markers in a scatter plot.

By the way, there is a hidden message in the graph of the convex hull. Can you see it? It has been hiding in the SAS/IML documentation for more than 20 years.

In closing, I'll mention that a 2-D convex hull is one computation in the general field of computational geometry. The SAS/IML group is working to add additional functionality to the language, including convex hulls in higher dimensions. In your work, do you have specific needs for results in computational geometry? If so, let me know the details in the comments.

The post Two-dimensional convex hulls in SAS appeared first on The DO Loop.

]]>The post Create a frequency polygon in SAS appeared first on The DO Loop.

]]>I was recently asked how to create a frequency polygon in SAS. A frequency polygon is an alternative to a histogram that shows similar information about the distribution of univariate data. It is the piecewise linear curve formed by connecting the midpoints of the tops of the bins. The graph to the right shows a histogram and a frequency polygon for the same data. This article shows how to create a frequency polygon in SAS.

In practice, frequency polygons are not used as often as histograms are, but they are useful pedagogical tools for teaching the fundamentals of density estimation. The histogram is an estimate of the density of univariate data, but it is a bar chart. Accordingly, it looks different from density estimate curves, such as parametric densities and kernel density estimates. The frequency polygon shows the same information as a histogram but displays the information as a line plot. Therefore, you can more easily compare the frequency polygon curve and other density estimate curves.

A frequency polygon is also a good way to introduce the ideas behind a cumulative distribution. An ogive is a graph of the cumulative sum of the vertical coordinates of the frequency polygon. The ogive approximates the cumulative distribution in the same way that the frequency polygon approximates the density.

You can use the UNIVARIATE procedure in SAS to generate the points for a frequency polygon. You can use the OUTHIST= option to specify a data set that contains the counts for each bar in the histogram. The midpoints of the histogram bins are contained in the _MIDPT_ variable. The count in each bin is contained in the _COUNT_ variable.

If you do not like the default width of the histogram bins, you can use the MIDPOINTS= option to specify your own set of midpoints. For example, the following statements create a histogram for the EngineSize variable in the Sashelp.Cars data set. You can use the SERIES statement in PROC SGPLOT to create a line plot that displays the vertical height of each histogram bar, as follows:

proc univariate data=sashelp.cars(keep=EngineSize); var EngineSize; histogram / outhist=OutHist grid vscale=count midpoints=(1.4 to 8.4 by 0.4); /* use midpoints= option to specify midpoints */ run; /* optionally, print the OutHist data */ /* proc print data=OutHist; run; */ title "Frequency Polygon"; proc sgplot data=OutHist; series x=_MIDPT_ y=_COUNT_ / markers; yaxis grid values=(0 to 80 by 20) label="Count" offsetmin=0; xaxis grid values=(1.4 to 8.4 by 0.4) label="Engine Size (L)"; run; |

The frequency polygon is shown. Like a histogram, the shape of the frequency polygon depends on the bin width and anchor position. You can change those values by using the MIDPOINTS= option.

As I mentioned earlier, an advantage of the frequency polygon is that it is a curve, not a bar chart. As such, it is easier to compare to other density estimates curves. In PROC UNIVARIATE, you can use the KERNEL option to overlay a kernel density curve on a histogram. You can use the OUTKERNEL= option to write the kernel density estimate to a data set. You can then overlay and compare the frequency curve (a crude histogram-based estimate) and the kernel density estimate, as follows:

proc univariate data=sashelp.cars(keep=EngineSize); var EngineSize; histogram / outhist=OutHist grid vscale=count kernel outkernel=OutKer midpoints=(1.4 to 8.4 by 0.4); /* use midpoints= option to specify midpoints */ ods select Moments Histogram; run; data Density; /* combine the estimates */ set OutHist OutKer(rename=(_Count_=KerCount)); run; title "Frequency Polygon and Kernel Density Estimate"; proc sgplot data=Density; series x=_MIDPT_ y=_COUNT_ / legendlabel="Frequency Polygon"; series x=_VALUE_ y=KerCount / legendlabel="Kernel Density Estimate"; yaxis offsetmin=0 grid values=(0 to 80 by 20) label="Estimated Count"; xaxis label="Engine Size (L)"; run; |

As shown in the graph, a kernel density estimate is a smoother version of the frequency polygon.

This article shows how to create a graph of the frequency polygon in SAS. A frequency polygon is a piecewise linear curve formed by connecting the midpoints of the tops of the bars in a histogram. The frequency polygon is a curve, so it is easier to compare it with other parametric or nonparametric density estimates.

One final remark: I don't like the name "frequency polygon." A polygon is a *closed* planar region formed by connecting a set of points and then connecting the first and last points. The density estimate in this article is not closed. I would prefer a term such as "frequency polyline" or "frequency curve," but "polygon" seems to be the standard term that appears in introductory statistics textbooks.

The post Create a frequency polygon in SAS appeared first on The DO Loop.

]]>The post The normal approximation and random samples of the binomial distribution appeared first on The DO Loop.

]]>Recall that the binomial distribution is the distribution of the number of successes in a set of independent Bernoulli trials, each having the same probability of success. Most introductory statistics textbooks discuss the approximation of the binomial distribution by the normal distribution. The graph to the right shows that the normal density (the red curve, N(μ=9500, σ=21.79)) can be a very good approximation to the binomial density (blue bars, Binom(p=0.95, nTrials=10000)). However, because the binomial density is discrete, the binomial density is defined only for positive integers, whereas the normal density is defined for all real numbers.

In this graph, the binomial density and the normal density are close. But what does the approximation look like if you overlay a bar chart of a random sample from the binomial distribution? It turns out that the bar chart can have large deviations from the normal curve, even for a large sample. Read on for an example.

I have written two previous articles that discuss the normal approximation to the binomial distribution:

- Overlay the binomial and normal densities: This article shows how to overlay the discrete binomial density and the continuous normal density by using VBARBASIC (or NEEDLE) statement and the SERIES statement in PROC SGPLOT. The graph to the right uses a SAS program that is presented in that article.
- The Normal approximation to the binomial distribution: How the quantiles compare. This article discusses the fact that the tails of the binomial distribution do not agree with the tails of the normal quantiles, even if the normal approximation is very close in the center of the distribution.

The normal approximation is used to estimate probabilities because it is often easier to use the area under the normal curve than to sum many discrete values. However, as shown in the second article, the discrete binomial distribution can have statistical properties that are different from the normal distribution.

It is important to remember that a *random sample* (from *any* distribution!) can look much different from the underlying probability density function.
The following graph shows a random sample from the binomial distribution Binom(0.95, 10000). The distribution of the sample looks quite different from the density curve.

In statistical terms, the observed and expected values have large deviations. Also, note that there can be a considerable deviation between adjacent bars. For example, in the graph, some bars have about 2% of the total frequency whereas an adjacent bar might have half that value. I was surprised to observe the large deviations in a large sample.

This graph emphasizes the fact that a random sample from the binomial distribution can look different from the smooth bell-shaped curve of the probability density.
I think I was surprised by the magnitude of the deviations from the expected values because I have more experience visualizing *continuous* distributions. For a continuous distribution, we use a histogram to display the empirical distribution. When the bin width of the histogram is greater than 1, it smooths out the differences between adjacent bars to create a much smoother estimate of the density. For example, the following graph displays the distribution of the same random sample but uses a bin width of 10 to aggregate the frequency. The resulting histogram is much smoother and resembles the normal approximation.

The normal approximation is a convenient way to compute probabilities for the binomial distribution. However, it is important to remember that the binomial distribution is a discrete distribution. A binomial random variable can assume values only for positive integers. One consequence of this observation is that a bar chart of a random binomial sample can show considerable deviations from the theoretical density. This is normal (pun intended!). It is often overlooked because if you treat the random variable as if it were continuous and use a histogram to estimate the density, the histogram smooths out a lot of the bar-to-bar deviations.

The following SAS program creates the graphs in this article.

/* define parameters for the Binom(p=0.95, nTrials=10000) simulation */ %let p = 0.95; /* probability of success */ %let NTrials = 10000; /* number of trials */ %let N = 1000; /* sample size */ /* First graph: Compute the density of the Normal and Binomial distributions. See https://blogs.sas.com/content/iml/2016/09/12/overlay-curve-bar-chart-sas.html */ data Binom; n = &nTrials; p = &p; q = 1 - p; mu = n*p; sigma = sqrt(n*p*q); /* parameters for the normal approximation */ Lower = mu-3.5*sigma; /* evaluate normal density on [Lower, Upper] */ Upper = mu+3.5*sigma; /* PDF of normal distribution */ do t = Lower to Upper by sigma/20; Normal = pdf("normal", t, mu, sigma); output; end; /* PMF of binomial distribution */ t = .; Normal = .; /* these variables are not used for the bar chart */ do j = max(0, floor(Lower)) to ceil(Upper); Binomial = pdf("Binomial", j, p, n); output; end; /* store mu and sigma in macro variables */ call symput("mu", strip(mu)); call symput("sigma", strip(round(sigma,0.01))); label Binomial="Binomial Probability" Normal="Normal Density"; keep t Normal j Binomial; run; /* overlay binomial density (needle plot) and normal density (series plot) */ title "Binomial Probability and Normal Approximation"; title2 "Binom(0.95, 10000) and N(9500, 21.79)"; proc sgplot data=Binom; needle x=j y=Binomial; series x=t y=Normal / lineattrs=GraphData2(thickness=2); inset "p = &p" "q = %sysevalf(1-&p)" "nTrials = &nTrials" "(*ESC*){unicode mu} = np = &mu" /* use Greek letters */ "(*ESC*){unicode sigma} = sqrt(npq) = &sigma" / position=topright border; yaxis label="Probability"; xaxis label="x" integer; run; /*************************/ /* Second graph: simulate a random sample from Binom(p, NTrials) */ data Bin(keep=x); call streaminit(1234); do i = 1 to &N; x = rand("Binomial", &p, &NTrials); output; end; run; /* count the frequency of each observed count */ proc freq data=Bin noprint; tables x / out=FreqOut; run; data All; set FreqOut Binom(keep=t Normal); Normal = 100*1*Normal; /* match scales: 100*h*PDF, where h=binwidth */ run; /* overlay sample and normal approximation */ title "Random Binomial Sample"; title2 "Bar Chart of Binom(0.95, 10000)"; proc sgplot data=All; needle x=x y=Percent; series x=t y=Normal / lineattrs=GraphData2(thickness=2); inset "n = &n" "p = &p" "(*ESC*){unicode mu} = &mu" /* use Greek letters */ "(*ESC*){unicode sigma} = &sigma" / position=topright border; yaxis label="Percent / Scaled Density"; xaxis label="x" integer; run; /*************************/ /* Third graph: the bar-to-bar deviations are smoothed if you use a histogram */ title2 "Histogram BinWidth=10"; proc sgplot data=Bin; histogram x / scale=percent binwidth=10; xaxis label="x" integer values=(9400 to 9600 by 20); run; /* for comparison, the histogram looks like the bar chart if you set BINWIDTH=1 */ title2 "Histogram BinWidth=1"; proc sgplot data=Bin; histogram x / binwidth=1 scale=percent; xaxis label="x" integer; run; |

The post The normal approximation and random samples of the binomial distribution appeared first on The DO Loop.

]]>The post Add reference lines to a bar chart in SAS appeared first on The DO Loop.

]]>This article shows two ways to overlay a reference line on the categorical axis of a bar chart. But the SAS programmer wanted more. He wanted to create a bar for each day of the year. That is a lot of bars! For bar charts that have many bars, I recommend using the NEEDLE statement to create a needle plot. The second part of this article demonstrates a needle plot and overlays reference lines for certain holidays.

For simplicity, this article discusses only vertical bar charts, but all programs can be adapted to display horizontal bar charts.

First, to be clear, you can easily add *horizontal* reference lines to a vertical bar chart. This is straightforward. The programmer wanted to add *vertical* reference lines to the *categorical* axis, as shown in the graph to the right. In this graph, reference lines are added behind the bars for Age=12 and Age=14. I made the bars semi-transparent so that the full reference lines are visible.

As the SAS programmer discovered, the following attempt to add reference lines does not display any reference lines:

title "Bar Chart with Reference Line on Categorical Axis"; proc sgplot data=Sashelp.Class; refline 12 14 / axis=x lineattrs=(color=red); /* DOES NOT WORK */ vbar Age / response=Weight transparency=0.2; run; |

Why don't the reference lines appear? As I have previously written, you must specify the *formatted* values for a categorical axis. This is mentioned in the documentation for the REFLINE statement, which states that "unformatted numeric values do not map to a formatted discrete axis. For example, if reference lines are drawn at points on a discrete X axis, the REFLINE values must be the formatted value that appears on the X axis."
In other words, you must change the REFLINE values to be "the formatted values," which are '12' and '14'. The following call to PROC SGPLOT displays the vertical reference lines:

proc sgplot data=Sashelp.Class; refline '12' '14' / axis=x lineattrs=(color=red); /* YES! THIS WORKS! */ vbar Age / response=Weight transparency=0.2; run; |

The reference lines are shown in the graph at the beginning of this section.

I prefer to use the VBARBASIC statement for most bar charts. If you use the VBARBASIC statement, you can specify the raw reference values. To be honest, I am not sure why it works, but, in general, the VBARBASIC statement is better when you need to overlay a bar chart and other graphical elements. If you use the VBARBASIC statement, the natural syntax works as expected:

proc sgplot data=Sashelp.Class; refline 12 14 / axis=x lineattrs=(color=red); /* THIS WORKS, TOO! */ vbarbasic Age / response=Weight transparency=0.2; run; |

The graph is the same as shown in the previous section.

This section discusses an example that has hundreds of bars. Suppose you want to display a bar chart for sales by date for an entire year. For data like these, I have two recommendations:

- Do not use a vertical bar chart. Even if each bar requires only three pixels, the chart will be more than 3*365 ≈ 1,100 pixels wide. On a monitor that displays 72 pixels per inch, this graph would be about 40 cm (15.3 inches) wide. A better choice is to use a needle plot, which is essentially a bar chart where each bar is represented as a vertical line.
- The horizontal axis cannot be discrete. If it is, you will get 365 dates printed along the axis. Instead, you want to use the XAXIS TYPE=TIME option to display the bars along an axis where tick marks are placed according to months, not days. (If the categories are not dates but are "days since the beginning," you can use the XAXIS TYPE=LINEAR option instead.)

Recall that the SAS programmer wanted to display holidays on the graph of sales for each day. Rather than specify the holidays on the REFLINE statement (for example, '01JAN2003'd '25DEC2003'd), it is more convenient to put the reference line values into a SAS data set and specify the name of the You can use the HOLIDAY function in SAS to get the date associated with major government holidays.

The following SAS DATA step extracts a year's worth of data for the sale of potato chips (in 2003) from the Sashelp.Snacks data set. These data are concatenated with a separate data set that contains the holidays that you want to display by using reference lines. A needle plot shows the daily sales and the reference lines.

data Snacks; /* sales of potato chips for each date in 2003 */ set Sashelp.Snacks; where '01JAN2003'd <= Date <= '31DEC2003'd AND Product="Classic potato chips"; run; data Reflines; /* holidays to overlay as reference lines */ format RefDate DATE9.; RefDate = holiday("Christmas", 2003); output; RefDate = holiday("Halloween", 2003); output; RefDate = holiday("Memorial", 2003); output; RefDate = holiday("NewYear", 2003); output; RefDate = holiday("Thanksgiving", 2003); output; RefDate = holiday("USIndependence", 2003); output; RefDate = holiday("Valentines", 2003); output; run; data All; /* concatentate the data and reference lines */ set Snacks RefLines; run; title "Sales and US Holidays"; title2 "Needle Plot"; proc sgplot data=All; refline RefDate / axis=x lineattrs=(color=red); needle x=Date y=QtySold; run; |

Notice that you do not have to use the XAXIS TYPE=TIME option with the NEEDLE statement. The SGPLOT procedure uses TYPE=TIME option by default when the X variable has a time, date, or datetime format. If you decide to use the VBARBASIC statement, you should include the XAXIS TYPE=TIME statement.

In summary, this article shows how to add vertical reference lines to a vertical bar chart. You can use the VBAR statement and specify the formatted reference values, but I prefer to use the VBARBASIC statement whenever I want to overlay a bar chart and other graphical elements. You can also use a needle plot, which is especially helpful when you need to display 100 or more bars.

The post Add reference lines to a bar chart in SAS appeared first on The DO Loop.

]]>The post Fit a mixture of Weibull distributions in SAS appeared first on The DO Loop.

]]>Although PROC UNIVARIATE can fit many univariate distributions, it cannot fit a mixture of distributions. For that task, you need to use PROC FMM, which fits finite mixture models. This article discusses how to use PROC FMM to fit a mixture of two Weibull distributions and how to interpret the results. The same technique can be used to fit other mixtures of distributions. If you are going to use the parameter estimates in SAS functions such as the PDF, CDF, and RAND functions, you cannot use the regression parameters directly. You must transform them into the distribution parameters.

You can use the RAND function in the SAS DATA step to simulate a mixture distribution that has two components, each drawn from a Weibull distribution.
The RAND function samples from a two-parameter Weibull distribution Weib(α, β) whose density is given by

\(f(x; \alpha, \beta) =
\frac{\beta}{\alpha^{\beta}} (x)^{\beta -1} \exp \left(-\left(\frac{x}{\alpha}\right)^{\beta }\right)\)

where
α is a shape parameter and β is a scale parameter. This parameterization is used by most Base SAS functions and procedures, as well as many regression procedures in SAS.
The following SAS DATA step simulates data from two Weibull distributions.
The first component is sampled from
Weib(α=1.5, β=0.8)
and the second component is sampled from
Weib(α=4, β=2). For the mixture distribution, the probability of drawing from the first distribution is 0.667 and the probability of drawing from the second distribution is 0.333.

After generating the data, you can call PROC UNIVARIATE to estimate the parameters for each component. Notice that this fits each component separately. If the parameter estimates are close to the parameter values, that is evidence that the simulation generated the data correctly.

/* sample from a mixture of two-parameter Weibull distributions */ %let N = 3000; data Have(drop=i); call streaminit(12345); array prob [2] _temporary_ (0.667 0.333); do i = 1 to &N; component = rand("Table", of prob[*]); if component=1 then d = rand("weibull", 1.5, 0.8); /* C=Shape=1.5; Sigma=Scale=0.8 */ else d = rand("weibull", 4, 2); /* C=Shape=4; Sigma=Scale=2 */ output; end; run; proc univariate data=Have; class component; var d; histogram d / weibull NOCURVELEGEND; /* fit (Sigma, C) for each component */ ods select Histogram ParameterEstimates Moments; ods output ParameterEstimates = UniPE; inset weibull(shape scale) / pos=NE; run; title "Weibull Estimates for Each Component"; proc print data=UniPE noobs; where Parameter in ('Scale', 'Shape'); var Component Parameter Symbol Estimate; run; |

The graph shows a histogram for data in each component. PROC UNIVARIATE overlays a Weibull density on each histogram, based on the parameter estimates.
The estimates for both components are close to the parameter values. The first component contains 1,970 observations, which is 65.7% of the total sample, so the estimated mixing probabilities are close to the mixing parameters. I used ODS OUTPUT and PROC PRINT to display one table that contains the parameter estimates from the two groups. PROC UNIVARIATE calls the shape parameter *c* and the scale parameter σ.

The PROC UNIVARIATE call uses the Component variable to identify the Weibull distribution to which each observation belongs. If you do not have the Component variable, is it still possible to estimate a two-component Weibull model?

The answer is yes. The FMM procedure fits statistical models for which the distribution of the response is a finite mixture of distributions. In general, the component distributions can be from different families, but this example is a homogeneous mixture, with both components from the Weibull family. When fitting a mixture model, we assume that we do not know which observations belong to which component. We must estimate the mixing probabilities and the parameters for the components. Typically, you need a lot of data and well-separated components for this effort to be successful.

The following call to PROC FMM fits a two-component Weibull model to the simulated data. As shown in a previous article, the estimates from PROC FMM are for the intercept and scale of the error term for a Weibull regression model. These estimates are different from the shape and scale parameters in the Weibull distribution. However, you can transform the regression estimates into the shape and scale parameters, as follows:

title "Weibull Estimates for Mixture"; proc fmm data=Have plots=density; model d = / dist=weibull link=log k=2; ods select ParameterEstimates MixingProbs DensityPlot; ods output ParameterEstimates=PE0; run; /* Add the estimates of Weibull scale and shape to the table of regression estimates. See https://blogs.sas.com/content/iml/2021/10/27/weibull-regression-model-sas.html */ data FMMPE; set PE0(rename=(ILink=WeibScale)); if Parameter="Scale" then WeibShape = 1/Estimate; else WeibShape = ._; /* ._ is one of the 28 missing values in SAS */ run; proc print data=FMMPE; var Component Parameter Estimate WeibShape WeibScale; run; |

The program renames the ILink column to WeibScale. It also adds a new column (WeibShape) to the ParameterEstimates table. These two columns display the Weibull shape and scale parameter estimates for each component. Despite not knowing which observation came from which component, the procedure provides good estimates for the Weibull parameters. PROC FMM estimates the first component as Weib(α=1.52, β=0.74) and the second component as Weib(α=3.53, β=1.88). It estimates the mixing parameters for the first component as 0.,6 and the parameter for the second component as 0.4.

The PLOTS=DENSITY option on the PROC FMM statement produces a plot of the data and overlays the component and mixture distributions. The plot is shown below and is discussed in the next section.

The PLOTS=DENSITY option produces a graph of the data and overlays the component and mixture distributions. In the graph, the red curve shows the density of the first Weibull component (W1(d)), the green curve shows the density of the second Weibull component (W2(d)), and the blue curve shows the density of the mixture. Technically, only the blue curve is a "true" density that integrates to unity (or 100% on a percent scale). The components are scaled densities. The integral of a component equals the mixing probability, which for these data are 0.6 and 0.4, respectively. The mixture density equals the sum of the component densities.

Look closely at the legend in the plot, which identifies the component curves by the parameter estimates.
Notice, that the estimates in the legend are the REGRESSION estimates, not the shape and scale estimates for the Weibull distribution. Do not be misled by the legend. If you plot the PDF

`density = PDF("Weibull", d, 0.74, 0.66); /* WRONG! */ `

you will NOT get the density curve for the first component. Instead, you need to convert the regression estimates into the shapes and scale parameters for the Weibull distribution. The following DATA step uses the transformed parameter estimates and demonstrates how to graph the component and mixture densities:

/* plot the Weibull component densitites and the mixture density */ data WeibComponents; retain d1 d2; array WeibScale[2] _temporary_ (0.7351, 1.8820); /* =exp(Intercept) */ array WeibShape[2] _temporary_ (1.52207, 3.52965); /* =1/Scale */ array MixParm[2] _temporary_ (0.6, 0.4); do d = 0.01, 0.05 to 3.2 by 0.05; d1 = MixParm[1]*pdf("Weibull", d, WeibShape[1], WeibScale[1]); d2 = MixParm[2]*pdf("Weibull", d, WeibShape[2], WeibScale[2]); Component = "Mixture "; density = d1+d2; output; Component = "Weib(1.52,0.74)"; density = d1; output; Component = "Weib(3.53,1.88)"; density = d2; output; end; run; title "Weibull Mixture Components"; proc sgplot data=WeibComponents; series x=d y=density / group=Component; keylegend / location=inside position=NE across=1 opaque; xaxis values=(0 to 3.2 by 0.2) grid offsetmin=0.05 offsetmax=0.05; yaxis grid; run; |

The density curves are the same, but the legend for this graph displays the shape and scale parameters for the Weibull distribution. If you want to reproduce the vertical scale (percent), you can multiply the densities by 100**h*, where *h*=0.2 is the width of the histogram bins.

In general, be aware that the PLOTS=DENSITY option produces a graph in which the legend labels refer to the REGRESSION parameters. For example, if you use PROC FMM to fit a mixture of normal distributions, the parameter estimates in the legend are for the mean and the VARIANCE of the normal distributions. However, if you intend to use those estimates in other SAS functions (such as PDF, CDF, and RAND), you must take the square root of the variance to obtain the standard deviation.

This article uses PROC FMM to fit a mixture of two Weibull distributions. The article shows how to interpret the parameter estimates from the procedure by transforming them into the shape and scale parameters for the Weibull distribution. The article also emphasizes that if you use the PLOTS=DENSITY option produces a graph, the legend in the graph contains the regression parameters, which are not the same as the parameters that are used for the PDF, CDF, and RAND functions.

The post Fit a mixture of Weibull distributions in SAS appeared first on The DO Loop.

]]>The post Interpret estimates for a Weibull regression model in SAS appeared first on The DO Loop.

]]>The relationship between scale and rate parameters is straightforward, but sometimes the relationship between different parameterizations is more complicated. Recently, a SAS programmer was using a regression procedure to fit the parameters of a Weibull distribution. He was confused about how the output from a SAS regression procedure relates to a more familiar parameterization of the Weibull distribution, such as is fit by PROC UNIVARIATE. This article shows how to perform two-parameter Weibull regression in several SAS procedures, including PROC RELIABILITY, PROC LIFEREG, and PROC FMM. The parameter estimates from regression procedures are not the usual Weibull parameters, but you can transform them into the Weibull parameters.

This article fits a two-parameter Weibull model. In a two-parameter model, the threshold parameter is assumed to be 0. A zero threshold assumes that the data can be any positive value.

PROC UNIVARIATE is the first tool to reach for if you want to fit a Weibull distribution in SAS.
The most common parameterization of the Weibull density is

\(f(x; \alpha, \beta) =
\frac{\beta}{\alpha^{\beta}} (x)^{\beta -1} \exp \left(-\left(\frac{x}{\alpha}\right)^{\beta }\right)\)

where
α is a shape parameter and β is a scale parameter. This parameterization is used by most Base SAS functions and procedures, as well as many regression procedures in SAS.
The following SAS DATA step simulates data from the Weibull(α=1.5, β=0.8) distribution and fits the parameters by using PROC UNIVARIATE:

/* sample from a Weibull distribution */ %let N = 100; data Have(drop=i); call streaminit(12345); do i = 1 to &N; d = rand("Weibull", 1.5, 0.8); /* Shape=1.5; Scale=0.8 */ output; end; run; proc univariate data=Have; var d; histogram d / weibull endpoints=(0 to 2.5 by 0.25); /* fit Weib(Sigma, C) to the data */ probplot / weibull2(C=1.383539 SCALE=0.684287) grid; /* OPTIONAL: P-P plot */ ods select Histogram ParameterEstimates ProbPlot; run; |

The histogram of the simulated data is overlaid with a density from the fitted Weibull distribution. The parameter estimates are Shape=1.38 and Scale=0.68, which are close to the parameter values.
PROC UNIVARIATE uses the symbols *c* and σ for the shape and scale parameters, respectively.

The probability-probability (P-P) plot for the Weibull distribution is shown.
In the P-P plot, a reference line is added by using the option `weibull2(C=1.383539 SCALE=0.684287)`. (In practice, you must run the procedure once to get those estimates, then a second time to plot the P-P plot.)
The slope of the reference line is 1/Shape = 0.72 and the intercept of the reference line is log(Scale) = -0.38. Notice that the P-P plot is plotting the quantiles of log(*d*), not of *d* itself.

It might seem strange to use a regression procedure to fit a univariate distribution, but as I have explained before, there are many situations for which a regression procedure is a good choice for performing a univariate analysis. Several SAS regression parameters can fit Weibull models. In these models, it is usually assumed that the response variable is a time until some event happens (such as failure, death, or occurrence of a disease). The documentation for PROC LIFEREG provides an overview of fitting a model where the logarithm of the random errors follows a Weibull distribution. In this article, we do not use any covariates. We simply model the mean and scale of the response variable.

A problem with using a regression procedure is that a regression model provides estimates for intercepts, slopes, and scales. It is not always intuitive to see how those regression estimates relate to the more familiar parameters for the probability distribution. However, the P-P plot in the previous section shows how intercepts and slopes can be related to parameters of a distribution. The documentation for the LIFEREG procedure states that the Weibull scale parameter is exp(Intercept) and the Weibull shape parameter is the reciprocal of the regression scale parameter.

Notice how confusing this is! For the Weibull distribution, the regression model estimates a SCALE parameter for the error distribution. But the reciprocal of that scale estimate is the Weibull SHAPE parameter, NOT the Weibull scale parameter! (In this article, the response distribution and the error distribution are the same.)

The LIFEREG procedure includes an option to produce a probability-probability (P-P) plot, which is similar to a Q-Q plot. The LIFEREG procedure also estimates not only the regression parameters but also provides estimates for the exp(Intercept) and 1/Scale quantities. The following statements use a Weibull regression model to fit the simulated data:

title "Weibull Estimates from LIFEREG Procedure"; proc lifereg data=Have; model d = / dist=Weibull; probplot; inset; run; |

The ParameterEstimates table shows the estimates for the Intercept (-0.38) and Scale (0.72) parameters in the Weibull regression model. We previously saw these numbers as the parameters of the reference line in the P-P plot from PROC UNIVARIATE. Here, they are the result of a maximum likelihood estimate for the regression model. To get from these values to the Weibull parameter estimates, you need to compute Weib_Scale = exp(Intercept) = 0.68 and Weib_Shape = 1/Scale = 1.38. PROC LIFEREG estimates these quantities for you and provides standard errors and confidence intervals.

The graphical output of the PROBPLOT statement is equivalent to the P-P plot in PROC UNIVARIATE, except that PROC LIFEREG reverses the axes and automatically adds the reference line and a confidence band.

Before ending this article, I want to mention two other regression procedures that perform similar computations: PROC RELIABILITY, which is in SAS/QC software, and PROC FMM in SAS/STAT software.

The following statements call PROC RELIABILITY to fit a regression model to the simulated data:

title "Weibull Estimates from RELIABILITY Procedure"; proc reliability data=Have; distribution Weibull; model d = ; run; |

The parameter estimates are similar to the estimates from PROC LIFEREG. The output also includes an estimate of the Weibull shape parameter, which is 1/EV_Scale. The output does not include an estimate for the Weibull scale parameter, which is exp(Intercept).

In a similar way, you can use PROC FMM to fit a Weibull model. PROC FMM Is typically used to fix a mixture distribution, but you can specify the K=1 option to fit a single response distribution, as follows:

title "Weibull Estimates from FMM Procedure"; proc fmm data=Have; model d = / dist=weibull link=log k=1; ods select ParameterEstimates; run; |

The ParameterEstimates table shows the estimates for the Intercept (-0.38) and Scale (0.72) parameters in the Weibull regression model. The Weibull scale parameter is shown in the column labeled "Inverse Linked Estimate." (The model uses a LOG link, so the inverse link is EXP.) There is no estimate for the Weibull shape parameter. which is the reciprocal of the Scale estimate.

The easiest way to fit a Weibull distribution to univariate data is to use the UNIVARIATE procedure in Base SAS. The Weibull shape and scale parameters are directly estimated by that procedure. However, you can also fit a Weibull model by using a SAS regression procedure. If you do this, the regression parameters are the Intercept and the scale of the error distribution. You can transform these estimates into estimates for the Weibull shape and scale parameters. This article shows the output (and how to interpret it) for several SAS procedures that can fit a Weibull regression model.

Why would you want to use a regression procedure instead of PROC UNIVARIATE? One reason is that the response variable (failure or survival) might depend on additional covariates. A regression model enables you to account for additional covariates and still understand the underlying distribution of the random errors. A second reason is that the FMM procedure can fit a mixture of distributions. To make sense of the results, you must be able to interpret the regression output in terms of the usual parameters for the probability distributions. In a second article, I show how to fit a mixture of Weibull distributions.

The post Interpret estimates for a Weibull regression model in SAS appeared first on The DO Loop.

]]>The post An introduction to genetic algorithms in SAS appeared first on The DO Loop.

]]>
A previous article discusses the mutation and crossover operators, which are important in implementing a genetic algorithm. In previous articles, the solution vectors were represented by column vectors. In the GA routines, candidates are *row* vectors. The population is a matrix, where each row represents an individual.

The *SAS/IML User's Guide* provides an overview of genetic algorithms. The five main steps follow:

**Encoding**: Each potential solution is represented as a*chromosome*, which is a vector of values. For the knapsack problem, each chromosome is an N-dimensional vector of binary values.**Fitness**: Choose a function to assess the fitness of each candidate chromosome. This is usually the objective function for unconstrained problems, or a penalized objective function for problems that have constraints. The fitness of a candidate determines the probability that it will contribute its genes to the next generation of candidates.**Selection**: Choose which candidates become parents to the next generation of candidates.**Crossover (Reproduction)**: Choose how to produce children from parents.**Mutation**: Choose how to randomly mutate some children to introduce additional diversity.

Each potential solution is represented as an N-dimensional vector of values. For the knapsack problem, you can choose a binary vector. In SAS/IML, you use the GASETUP function to define the encoding for a problem. The SAS/IML language supports four different encodings. The knapsack problem can be encoded by using an integer vector, as follows:

/* Individuals are ROW vectors. Population is a matrix of stacked rows. */ proc iml; call randseed(12345); /* Solve the knapsack problem: max Value*b subject to Weight*b <= WtLimit */ /* Item: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 */ Weight = {2 3 4 4 1.5 1.5 1 1 1 1 1 1 1 1 1 1 1}`; Value = {6 6 6 5 3.1 3.0 1.5 1.3 1.2 1.1 1.0 1.1 1.0 1.0 0.9 0.8 0.6}`; WtLimit = 9; /* weight limit */ N = nrow(Weight); /* set up an encoding for the GA */ id = gaSetup(2, /* 2-> integer vector encoding */ nrow(weight), /* size of vector */ 123); /* internal seed for GA */ |

The GASETUP function returns an identifier for the problem. This identifier must be used as the first argument to subsequent calls to GA routines. It is possible to have a program that runs several GAs, each with its own identifier. The GASETUP call specifies that candidate vectors are integer vectors. Later in the program, you can tell the GA that they are binary vectors.

In previous articles, I discussed fitness, mutation, and crossover functions:

- You can use a penalized objective function to evaluate the fitness of a candidate solution.. In the SAS/IML language, use the GASETOBJ subroutine to register the objective function and to specify whether the GA should minimize or maximize the function.
- You can define a mutation subroutine to control how candidates get mutated. You can register the mutation module by using the GASETMUT subroutine. When you call the GASETMUT subroutine, you also need to specify the probability that a candidate will be mutated.
- You can define a crossover subroutine to control how two parents produce two children. You can register the crossover module by using the GASETCRO subroutine. When you call the GASETCRO subroutine, you also need to specify the probability that a fit candidate in the current generation is chosen to reproduce and thus contribute its characteristics to the next generation.

The following SAS/IML statements define the fitness module (ObjFun), the mutation module (Mutate), and the crossover module (Cross) and register these modules with the GA system.

/* just a few of the many hyperparameters */ lambda = 100; /* factor to penalize exceeding the weight limit */ ProbMut= 0.2; /* probability that the i_th site is mutated */ ProbCross = 0.3; /* children based on 30%-70% split of parents */ /* b is a binary column vector */ start ObjFun( b ) global(Weight, Value, WtLimit, lambda); wsum = b * Weight; val = b * Value; if wsum > WtLimit then /* penalize if weight limit exceeded */ val = val - lambda*(wsum - WtLimit)##2; /* subtract b/c we want to maximize value */ return(val); finish; /* Mutation operator for a binary vector, b. */ start Mutate(b) global(ProbMut); N = ncol(b); k = max(1, randfun(1, "Binomial", ProbMut, N)); /* how many sites? */ j = sample(1:N, k, "NoReplace"); /* choose random elements */ b[j] = ^b[j]; /* mutate these sites */ finish; /* Crossover operator for a a pair of parents. */ start Cross(child1, child2, parent1, parent2) global(ProbCross); b = j(ncol(parent1), 1); call randgen(b, "Bernoulli", ProbCross); /* 0/1 vector */ idx = loc(b=1); /* locations to cross */ child1 = parent1; child2 = parent2; if ncol(idx)>0 then do; /* exchange values */ child1[idx] = parent2[idx]; child2[idx] = parent1[idx]; end; finish; /* register these modules so the GA can call them as needed */ call gaSetObj(id, 1, "ObjFun"); /* 1->maximize objective module */ call gaSetCro(id, 1.0, /* hyperparameter: crossover probability */ 0, "Cross"); /* user-defined crossover module */ call gaSetMut(id, 0.20, /* hyperparameter: mutation probability */ 0, "Mutate"); /* user-defined mutation module */ |

A genetic algorithm pits candidates against each other according to Darwin's observation that individuals who are fit are more likely to pass on their characteristics to the next generation.

A genetic algorithm evolves the population across many generations. The individuals who are more fit are likely to "reproduce" and send their progeny to the next round.
To preserve the characteristics of the very best individuals (called *elites*), some
individuals are "cloned" and passed unchanged to the next generations.
After many rounds, the population is fitter, on average, and the elite individuals are the best solutions to the objective function.

In SAS/IML, you can use the GASETSEL subroutine to specify the rules for selecting elite individuals and for are selecting which individuals are eligible to reproduce. The following call specifies that 3 elite individuals are "cloned" for the next generation. For the remaining individuals, pairs are selected at random. With 95% probability, the more fit individual is selected to reproduce:

call gaSetSel(id, 3, /* hyperparameter: carry k elites directly to next generation */ 1, /* dual tournament */ 0.95); /* best-player-wins probability */ |

At this point, the GA problem is fully defined.

You can use the GAINIT subroutine to generate the initial population. Typically, the initial population is generated randomly. If there are bounds constraints on the solution vector, you can specify them. For example, the following call generates a very small population of 10 random individuals. The chromosomes are constrained to be binary by specifying a lower bound of 0 and an upper bound of 1 for each element of the integer vector.

/* example of very small population; only 10 candidates */ call gaInit(id, 10, /* initial population size */ j(1, nrow(weight), 0) // /* lower and upper bounds of binary vector */ j(1, nrow(weight), 1) ); |

You can evolve the population by calling the GAREGEN subroutine. The GAREGEN call selects individuals to reproduce according to the tournament rules. The selected individuals become parents. Pairs of parents produce children according to the crossover operation. Some children are mutated according to the mutation operator. The children become the next generation and replace the parents as the "current" population.

At any time, you can use the GAGETMEM subroutine to obtain the members of the current population. You can use the GAGETVAL subroutine to obtain the fitness scores of the current population. Let's manually call the GAREGEN subroutine a few times and use the GAGETVAL subroutine after each call to see how the population evolves:

call gaGetVal(f0, id); /* initial generation is random */ call gaRegen(id); /* create the next generation via selection, crossover, and mutation */ call gaGetVal(f1, id); /* evaluate fitness */ call gaRegen(id); /* continue for additional generations ...*/ call gaGetVal(f2, id); /* print fitness for each generation and top candidates so far */ print f0[L="Initial Fitness"] f1[L="1st Gen Fitness"] f2[L="2nd Gen Fitness"]; call gaGetMem(best, val, id, 1:3); print best[f=1.0 L="Best Members"], val[L="Best Values"]; |

The output shows how the population evolves for three generations. Initially, only one member of the population satisfies the weight constraints of the knapsack problem. The feasible solution has a positive fitness score (9.7); the infeasible solutions are negative for this problem. After the selection, crossover, and mutation operations, the next generation has two feasible solutions (9.7 and 8.1). After another round, the population has six feasible solutions and the score for the best solution has increased to 11.7. The population is becoming more fit, on average.

After the second call to GAREGEN, the GAGETMEM call gets the candidates in positions 1:3. Recall that you specified three "elite" members in the GASETSEL call. The elite members are therefore placed at the top of the population. (The remaining individuals are not sorted according to fitness.) The chromosomes for the elite members are shown. In this encoding, each chromosome is a 0/1 binary vector that determines which objects are placed in the knapsack and which are left out.

The previous section purposely used a small population of 10 individuals. Such a small population lacks genetic diversity and might not converge to an optimal solution (or at least not quickly). A more reasonable population contains 100 or more individuals. You can use the GAINIT call a second time to reinitialize the initial population. Now that you have experience with the GAREGEN and GAGETVAL calls, you can use those calls in a DO loop to iterate over many generations. The following statements iterate the initial population through 15 generations. For each generation, the program records the best fitness score (f[1]) and the median fitness score.

/* for genetic diversity, better to have a larger population */ call gaInit(id, 100, /* initial population size */ j(1, nrow(weight), 0) // /* lower and upper bounds of binary vector */ j(1, nrow(weight), 1) ); /* record the best and median scores for each generation */ niter = 15; summary = j(niter,3); summary[,1] = t(1:niter); /* (Iteration, Best Value, Median Value) */ do i = 1 to niter; call gaRegen(id); call gaGetVal(f, id); summary[i,2] = f[1]; summary[i,3] = median(f); end; print summary[c = {"Iteration" "Best Value" "Median Value"}]; |

The output shows that the fitness of the best candidate is monotonic. It increases from an initial value of 16.3 to the final value of 19.6, which is the optimal value. Similarly, the median value tends to increase. Because of random mutation and the crossover operations, statistics such as the median or mean are usually not monotonic. Nevertheless, the fitness of later generations tends to be better than for earlier generations.

You can examine the chromosomes of the elite candidates in the final generation by using the GAGETMEM subroutine:

/* print the top candidates */ call gaGetMem(best, f, id, 1:3); print best[f=1.0 L="Best Member"], f[L="Final Best Value"]; |

The output confirms that the best candidate is the same binary vector as was found by using a constrained linear program in a previous article.

The GA algorithm maintains an internal state, which enables you to continue iterating if the current generation is not satisfactory. In this case, the problem is solved, so you can use the GAEND subroutine to release the internal memory and resources that are associated with this GA. After you call GAEND, the identifier becomes invalid.

call gaEnd(id); /* free the memory and internal resources for this GA */ |

As with any tool, it is important to recognize that a GA has strengths and weaknesses. Strengths include:

- A GA is amazingly flexible. It can be used to solve a wide variety of optimization problems.
- A GA can provide useful suboptimal solutions. The elite members of a population might be "good enough," even if they are not optimal.

Weaknesses include:

- A GA is dependent on random operations. If you change the random number seed, you might obtain a completely different solution or no solution at all.
- A GA can take a long time to produce an optimal solution. It does not tell you whether a candidate is optimal, only that it is the "most fit" so far.
- A GA requires many heuristic choices. It is not always clear how to implement the mutation and crossover operators or how to implement the tournament that selects individuals to be parents.

In summary, this article shows how to use low-level routines in SAS/IML software to implement a genetic algorithm. Genetic algorithms can solve optimization problems that are intractable for traditional mathematical optimization algorithms. Like all tools, a GA has strengths and weaknesses. By gaining experience with GAs, you can build intuition about when and how to apply this powerful method.

For other ways to use genetic algorithms in SAS, see the GA procedure in SAS/OR software and the black-box solver in PROC OPTMODEL.

The post An introduction to genetic algorithms in SAS appeared first on The DO Loop.

]]>The post Crossover and mutation: An introduction to two operations in genetic algorithms appeared first on The DO Loop.

]]>Some programmers love using genetic algorithms. Genetic algorithms are heuristic methods that can be used to solve problems that are difficult to solve by using standard discrete or calculus-based optimization methods. A genetic algorithm tries to mimic natural selection and evolution by starting with a population of random candidates. Candidates are evaluated for "fitness" by plugging them into the objective function. The better candidates are combined to create a new set of candidates. Some of the new candidates experience mutations. Eventually, over many generations, a GA can produce candidates that approximately solve the optimization problem. Randomness plays an important role. Re-running a GA with a different random number seed might produce a different solution.

Critics of genetic algorithms note two weaknesses of the method. First, you are not guaranteed to get the optimal solution. However, in practice, GAs often find an acceptable solution that is good enough to be used. The second complaint is that the user must make many heuristic choices about how to implement the GA. Critics correctly note that implementing a genetic algorithm is as much an art as it is a science. You must choose values for hyperparameters and define operators that are often based on a "feeling" that these choices might result in an acceptable solution.

This article discusses two fundamental parts of a genetic algorithm: the crossover and the mutation operators. The operations are discussed by using the binary knapsack problem as an example. In the knapsack problem, a knapsack can hold W kilograms. There are N objects, each with a different value and weight. You want to maximize the value of the objects you put into the knapsack without exceeding the weight.
A solution to the knapsack problem is a 0/1 binary vector **b**. If **b**[i]=1, the i_th object is in the knapsack; if **b**[i]=0, it is not.

The *SAS/IML User's Guide* provides an overview of genetic algorithms. The main steps in a genetic algorithm are as follows:

**Encoding**: Each potential solution is represented as a*chromosome*, which is a vector of values. The values can be binary, integer-valued, or real-valued. (The values are sometimes called genes.) For the knapsack problem, each chromosome is an N-dimensional vector of binary values.**Fitness**: Choose a function to assess the fitness of each candidate chromosome. This is usually the objective function for unconstrained problems, or a penalized objective function for problems that have constraints. The fitness of a candidate determines the probability that it will contribute its genes to the next generation of candidates.**Selection**: Choose which candidates become parents to the next generation of candidates.**Crossover (Reproduction)**: Choose how to produce children from parents.**Mutation**: Choose how to randomly mutate some children to introduce additional diversity.

This article discusses the crossover and the mutation operators.

The mutation operator is the easiest operation to understand. In each generation, some candidates are randomly perturbed. By chance, some of the mutations might be beneficial and make the candidate more fit. Others are detrimental and make the candidate less fit.

For a binary chromosome, a mutation consists of changing the parity of some proportion of the elements.
The simplest mutation operation is to always change *k* random elements for some hyperparameter *k* < N.
A more realistic mutation operation is to choose the number of sites randomly according to a binomial probability distribution with hyperparameter *p*_{mut}. Then *k* is a random variable that differs for each mutation operation.

The following SAS/IML program
chooses *p*_{mut}=0.2 and defines a subroutine that mutates a binary vector, **b**. In this example, there are N=17 items that you can put into the knapsack. The subroutine first uses the Binom(*p*_{mut}, N) probability distribution to obtain a random number of sites, *k*, to mutate. (But if the distribution returns 0, set *k*=1.)
The SAMPLE function then draws *k* random positions (without replacement),
and the values in those positions are changed.

proc iml; call randseed(12345); N = 17; /* size of binary vector */ ProbMut= 0.2; /* mutation in 20% of sites */ /* Mutation operator for a binary vector, b. The number of mutation sites k ~ Binom(ProbMut, N), but not less than 1. Randomly sample (without replacement) k sites. If an item is not in knapsack, put it in; if an item is in the sack, take it out. */ start Mutate(b) global(ProbMut); N = nrow(b); k = max(1, randfun(1, "Binomial", ProbMut, N)); /* how many sites? */ j = sample(1:N, k, "NoReplace"); /* choose random elements */ b[j] = ^b[j]; /* mutate these sites */ finish; Items = 5:12; /* choose items 5-12 */ b = j(N,1,0); b[Items] = 1; bOrig = b; run Mutate(b); print (bOrig`)[L="Original b" c=(1:N)] (b`)[L="Randomly Mutated b" c=(1:N)]; |

In this example, the original chromosome has a 1 in locations 5:12. The binomial distribution randomly decides to mutate *k*=4 sites. The SAMPLE function randomly chooses the locations 3, 11, 15, and 17. The parity of those sites is changed. This is seen in the output, which shows that the parity of these four sites differs between the original and the mutated **b** vector.

Notice that you must choose HOW the mutation operator works, and you must choose a hyperparameter that determines how many sites get mutated. The best choices depend on the problem you are trying to solve. Typically, you should choose a small value for the probability *p*_{mut} so that only a few sites are mutated.

In the SAS/IML language, there are several built-in mutation operations that you can use. They are discussed in the documentation for the GASETMUT subroutine.

The crossover operator is analogous to the creation of offspring through sexual reproduction. You, as the programmer, must decide how the parent chromosomes, p1 and p2, will combine to create two children, c1 and c2. There are many choices you can make. Some reasonable choices include:

- Randomly choose a location
*s*, 1 ≤*s*≤ N. You then split the parent chromosomes at that location and exchange and combine the left and right portions of the parents' chromosomes. One child chromosome is`c1 = p1[1:s] // p2[s+1:N]`and the other is`c2 = p2[1:s] // p1[s+1:N]`. Note that each child gets some values ("genes") from each parent. - Randomly choose a location
*s*, 1 ≤*s*≤ N. Divide the first chromosome into subvectors of length*s*and N-*s*. Divide the second chromosome into subvectors of length N-*s*and*s*. Exchange the subvectors of the same length to form the child chromosomes. - Randomly choose
*k*locations. Exchange the locations between parents to form the child chromosomes.

The following SAS/IML function implements the third crossover method.
The method uses a hyperparameter, *p*_{cross}, which is the probability that each location in the chromosome is selected. On average, about N*p*_{cross} locations will be selected. In the following program, *p*_{cross} = 0.3, so we expect 17(0.3)=5.1 values to be exchanged between the parent chromosomes to form the children:

start uniform_cross(child1, child2, parent1, parent2) global(ProbCross); b = j(nrow(parent1), 1); call randgen(b, "Bernoulli", ProbCross); /* 0/1 vector */ idx = loc(b=1); /* locations to cross */ child1 = parent1; /* initialize children */ child2 = parent2; if ncol(idx)>0 then do; /* exchange values */ child1[idx] = parent2[idx]; /* child1 gets some from parent2 */ child2[idx] = parent1[idx]; /* child2 gets some from parent1 */ end; finish; ProbCross = 0.3; /* crossover 25% of sites */ Items = 5:12; p1 = j(N,1,0); p1[Items] = 1; /* choose items 5-12 */ Items = 10:15; p2 = j(N,1,0); p2[Items] = 1; /* choose items 10-15 */ run uniform_cross(c1, c2, p1, p2); print (p1`)[L="Parent1" c=(1:N)], (p2`)[L="Parent2" c=(1:N)], (c1`)[L="Child1" c=(1:N)], (c2`)[L="Child2" c=(1:N)]; |

I augmented the output to show how the child chromosomes are created from their parents. For this run, the selected locations are 1, 8, 10, 12, and 14. The first child gets all values from the first parent except for the values in these five positions, which are from the second parent. The second child is formed similarly.

When the parent chromosomes resemble each other, the children will resemble the parents. However, if the parent chromosomes are very different, the children might not look like either parent.

Notice that you must choose HOW the crossover operator works, and you must choose a hyperparameter that determines how to split the parent chromosomes. In more sophisticated crossover operations, there might be additional hyperparameters, such as the probability that a subchromosome from a parent gets reversed in the child. There are many heuristic choices to make, and the best choice is not knowable.

In the SAS/IML language, there are many built-in crossover operations that you can use. They are discussed in the documentation for the GASETCRO subroutine.

Genetic algorithms can solve optimization problems that are intractable for traditional mathematical optimization algorithms. But the power comes at a cost. The user must make many heuristic choices about how the GA should work. The user must choose hyperparameters that control the probability that certain events happen during mutation and crossover operations. The algorithm uses random numbers to generate new potential solutions from previous candidates. This article used the SAS/IML language to discuss some of the choices that are required to implement these operations.

A subsequent article discusses how to implement a genetic algorithm in SAS.

The post Crossover and mutation: An introduction to two operations in genetic algorithms appeared first on The DO Loop.

]]>