PUT it there! Six tips for using PUT and %PUT statements in SAS

For SAS programmers, the PUT statement in the DATA step and the %PUT macro statement are useful statements that enable you to display the values of variables and macro variables, respectively. By default, the output appears in the SAS log. This article shares a few tips that help you to use these statements more effectively.

Tip 1: Display the name and value of a variable

The PUT statement supports a "named output" syntax that enables you to easily display a variable name and value. The trick is to put an equal sign immediately after the name of a variable: PUT varname=; For example, the following statement displays the text "z=" followed by the value of z:

data _null_;
x = 9.1; y = 6; z = sqrt(x**2 + y**2);
put z=;           /* display variable and value */
run;
z=10.9

Tip 2: Display values of arrays

You can extend the previous tip to arrays and to sets of variables. The PUT statement enables you to display elements of an array (or multiple variables) by specifying the array name in parentheses, followed by an equal sign in parentheses, as follows:

data _null_;
array x[5];
do k = 1 to dim(x);
   x[k] = k**2;
end;
put (x[*]) (=);     /* put each element of array on separate lines */
put (x1 x3 x5) (=); /* put each variable/value on separate lines */
run;
x1=1 x2=4 x3=9 x4=16 x5=25
x1=1 x3=9 x5=25

This syntax is not supported for _TEMPORARY_ arrays. However, as a workaraound, you can use the CATQ function to concatenate array values into a character variable, as follows:

temp = catq('d', ',', of x[*]);         /* x can be _TEMPORARY_ array */
put temp=;

Incidentally, if you ever want to apply a format to the values, the format name goes inside the second set of parentheses, after the equal sign: put (x1 x3 x5) (=6.2);

Tip 3: Display values on separate lines

The previous tip displayed all values on a single line. Sometimes it is useful to display each value on its own line. To do that, put a slash after the equal sign, as follows:

...
put (x[*]) (=/);                   /* put each element on separate lines */
...
x1=1
x2=4
x3=9
x4=16
x5=25

Tip 4: Display all name-value pairs

You can display all values of all variables by using the _ALL_ keyword, as follows:

data _null_;
x = 9.1; y = 6; z = sqrt(x**2 + y**2);
A = "SAS"; B = "Statistics";
put _ALL_;              /* display all variables and values */
run;
x=9.1 y=6 z=10.9 A=SAS B=Statistics _ERROR_=0 _N_=1

Notice that in addition to the user-defined variables, the _ALL_ keyword also prints the values of two automatic variables named _ERROR_ and _N_.

Tip 5: Display the name and value of a macro variable

Just as the PUT statement displays the value of an ordinary variable, you can use the %PUT statement to display the value of a macro variable. If you use the special "&=" syntax, SAS will display the name and value of a macro variable. For example, to display your SAS version, you can display the value of the SYSVLONG automatic system macro variable, as follows:

%put &=SYSVLONG;
SYSVLONG=9.04.01M4P110916

The results above are for my system, which is running SAS 9.4M4. Your SAS version might be different.

Tip 6: Display all name-value pairs for macros

You can display the name and value of all user-defined macros by using the _USER_ keyword. You can display the values of all SAS automatic system macros by using the _AUTOMATIC_ keyword.

%let N = 50;
%let NumSamples = 1e4;
%put _USER_;
GLOBAL N 50
GLOBAL NUMSAMPLES 1e4

Conclusion and References

There you have it: six tips to make it easier to display the value of SAS variables and macro variables. Thanks to Jiangtang Hu who pointed out the %PUT &=var syntax in his blog in 2012. For additional features of the PUT and %PUT statements, see:

Post a Comment

Ten posts from 2016 that deserve a second look

Last week I wrote about the 10 most popular articles from The DO Loop in 2016. The popular articles tend to be about elementary topics that appeal to a wide range of SAS programmers. Today I present an "editor's choice" list of technical articles that describe more advanced statistical methods in SAS.

I've grouped the articles into three categories: statistical graphics and visualization, statistical computations, and matrix computations. If you are a SAS statistical programmer, these articles deserve a second look.

Ten posts from The DO Loop that deserve a second look #SASTip Click To Tweet

Statistical graphics and visualization

An effect plot

SAS ODS graphics provides an easy way to create standard graphs for data analysis. The graphs in this list are more sophisticated:

Statistical computations

A nearest neighbor plot

These article show helpful statistical techniques that you should know about:

Matrix computations

markovabsorbing1

The SAS DATA step is awesome. For many programming tasks, it is an efficient and effective tool. However, advanced analytical algorithms and multivariate statistics often require matrix-vector computations, which means programming in the SAS/IML language.

There you have it, 10 articles from The DO Loop in 2016 that I think are worth a second look. Did I omit your favorite article? Leave a comment.

Post a Comment

ODS OUTPUT: Store any statistic created by any SAS procedure

In the beginning SAS created procedures and output. The output was formless and void. Then SAS said, "Let there be ODS," and there was ODS. Customers saw that ODS was good, and SAS separated the computation from the display and management of output.

The preceding paragraph oversimplifies the SAS Output Delivery System (ODS), but the truth is that ODS is a powerful feature of SAS. You can use ODS to send SAS tables and graphics to various output destinations, including HTML, PDF, RTF, and PowerPoint. You can control the style and attributes of the output, thus creating a customized report. There have been hundreds of papers and books written about ODS. A very basic introduction is Olinger (2000) "ODS for Dummies."

To a statistical programmer the most useful destination is the OUTPUT destination. The OUTPUT destination sends a table or graph to a SAS data set. Consequently, you can programmatically access each element of the output.

The implications of the previous statement are monumental. I cannot overstate the importance of the OUTPUT destination, so let me say it again:

The ODS OUTPUT destination enables you to store any value that is produced by any SAS procedure. You can then read that value by using a SAS program.

The ODS OUTPUT destination answers a common question that is asked by new programmers on SAS discussion forums: "How can I get a statistic into a data set or into a macro variable?" The steps are as follows:

  1. Use ODS TRACE ON (or the SAS documentation) to find the name of the ODS table that contains the statistic that you want.
  2. Use the ODS OUTPUT statement to specify the table name and a data set name. The syntax is ODS OUTPUT TableName=DataSetName. Then run the procedure to generate the table.
  3. Read the data set to obtain the value of the statistic.
New to #SAS programming? How to get any statistic into a data set. Click To Tweet

Find the name of the ODS table

As an example, suppose that you intend to use PROC REG to perform a linear regression, and you want to capture the R-square value in a SAS data set. The documentation for the procedure lists all ODS tables that the procedure can create, or you can use the ODS TRACE ON statement to display the table names that are produced by PROC REG. The data are the 428 vehicles in the Sashelp.Cars data set, which is distributed with SAS:

ods trace on;                           /* write ODS table names to log */
proc reg data=Sashelp.Cars plots=none;
   model Horsepower = EngineSize Weight;
quit;
ods trace off;                          /* stop writing to log */
odsoutput1
Output Added:
-------------
Name:       FitStatistics
Label:      Fit Statistics
Template:   Stat.REG.FitStatistics
Path:       Reg.MODEL1.Fit.Horsepower.FitStatistics
-------------

By looking at the output, you can see that the third table contains the R-square value. By looking at the SAS log, you can see that the name of the third table is "FitStatistics."

Save the table to a SAS data set

Now that you know the name of the ODS table is "FitStatistics," use the ODS OUTPUT destination to write that table to a SAS data set, as follows:

ods output FitStatistics=Output;        /* the data set name is 'Output' */
proc reg data=Sashelp.Cars plots=none;  /* same procedure call */
   model Horsepower = EngineSize Weight;
quit;
 
proc print data=Output noobs;
run;
odsoutput2

The output from PROC PRINT shows the structure of the output data set. Notice that the data set often looks different from the original displayed table. The data set contains non-printing columns (like Model and Dependent) that do not appear in the displayed table. The data set also contains columns that contain the raw numerical values and the (formatted) character values of the statistics. The columns cValue1 and nValue1 represent the same information, except that the cValue1 is a character column whereas nValue1 is a numerical column. The same applies to the cValue2 and nValue2 columns. The character values might contain formatted or rounded values.

Read the value of the statistic into a macro variable

From the previous PROC PRINT output, you can see that the numerical value of the R-square statistic is in the first row and is in the nValue2 column. You can therefore read and process that value by using a standard WHERE clause. For example, the following statements use the SYMPUTX subroutine to create a macro variable that contains the value of the R-square statistic:

data _null_;
set Output;
if Label2="R-Square" then call symputx("RSq", nValue2);
run;
 
%put RSq = &RSq;
RSq = 0.6201360929

The SAS log shows that the R-square value is now contained in the Rsq macro variable.

Storing the statistic in a macro variable is only one way to use the data set. You could also read the statistics into PROC IML or PROC SQL for further computation, or show the value of the statistic in a graph.

BY-group processing: Multiple samples and multiple statistics

The previous sections show how to save a single table to a SAS data set. It is just as easy to create a data set that contains multiple statistics, one for each level in a BY-group analysis.

Suppose that you want to run several regressions, one for each value of the Origin variable, which has the values "Asia," "Europe," and "USA." The following call to PROC SORT sorts the data by the Origin variable. The sorted data is stored in the CARS data set.

proc sort data=Sashelp.Cars out=Cars;
   by Origin;
run;

You can then specify Origin on the BY statement in PROC REG to carry out three regression analyses. When you run a BY-group analysis, you might not want to see all of the results displayed on the computer screen, especially if your goal is to save the results in an output data set. You can use the ODS EXCLUDE statement to suppress SAS output.

ods exclude all;                    /* suppress tables to screen */
ods output FitStatistics=Output2;   /* 'Output2' contains results for each BY group */
proc reg data=Cars plots=none;
   by Origin;
   model Horsepower = EngineSize Weight;
quit;
ods exclude none;                   /* no longer suppress tables */
 
proc print data=Output2 noobs;
   where Label2="R-Square";
   var Origin Label2 nValue2;
run;
odsoutput3

The output from PROC PRINT shows the R-square statistics for each model. Notice that the BY-group variables (in this case, Origin) are added to output data sets when you run a BY-group analysis. You can now use the statistics in programs or graphs.

Alternatives to using ODS OUTPUT

Some procedures provide an alternative option for creating an output data set that contains statistics. Always check the SAS documentation to see if the procedure provides an option that writes common statistics to an output data set. For example, the documentation for the PROC REG statement states that you can use the OUTEST= option with the RSQUARE option to obtain an output data set that contains the parameter estimates and other model statistics such as the R-square value. Thus for this example, you do not need to use the ODS OUTPUT statement to direct the FitStatistics table to a data set. Instead, you can obtain the statistic as follows:

proc reg data=Cars NOPRINT outest=Output3 RSQUARE; /* statistics in 'Output3' */
   by Origin;
   model Horsepower = EngineSize Weight;
quit;
 
proc print data=Output3 noobs;
   format _RSQ_ 8.6;
   var Origin _RSQ_;
run;
odsoutput4

Summary

In summary, the ODS OUTPUT statement enables you to create a data set that contains any statistic that is produced by a SAS procedure. You can use the ODS OUTPUT statement to capture a statistic and use it later in your program.

Post a Comment

Is "La Quinta" Spanish for "Next to Denny's"?

“La Quinta” is Spanish for “next to Denny’s.”
     -- Mitch Hedberg, comedian

Mitch Hedberg's joke resonates with travelers who drive on the US interstate system because many highway exits feature both a La Quinta Inn™ and a Denny's® restaurant within a short distance of each other. But does a statistical data analysis support this anecdotal evidence?

In 2014 John Reiser wrote a blog post that uses the Python language to scrape the web for the locations of La Quinta Inns and Denny's restaurants. He then analyzed the data to show that, yes, in general, a guest at a La Quinta Inn does not have far to travel if he wants to eat at a Denny's restaurant. This work inspired Colin Rundel and Mine Cetckaya-Rundel to assign this analysis as a project for their students at Duke University and to write an article in CHANCE magazine (29(2), 2016) about the assignment.

Rundel and Cetckaya-Rundel posted CSV files on the CHANCE web site that contain the longitude, latitude, and addresses of 851 La Quinta Inns and 1,634 Denny's restaurants in the contiguous US (as of Dec 2015). This article follows their presentation and shows how to analyze the La Quinta-Denny's spatial data in SAS. The analysis is a straightforward extension of the nearest-neighbor techniques shown in article "Distances between observations in two groups." You can download the SAS program that imports the data and creates all the graphs and tables in this article.

Visualizing the locations of La Quinta Inns and Denny's restaurants

Locations of La Quinta Inns and Denny's restaurants in the contiguous US

You can use PROC HTTP in SAS to read CSV files directly from a URL. After importing the locations of La Quinta Inns and Denny's restaurants from the CHANCE web site, you can use PROC SGPLOT to plot the (unprojected) locations as longitudes and latitudes. To help visualize the locations, the adjacent graph overlays the data on an outline of the lower-48 states in the US. (Click the graph to enlarge.)

In the graph, the La Quinta Inns are represented by blue circles and Denny's by a red cross. In the enlarged version you can see that many circles enclose a red cross, which indicates that the inn and restaurant are very close. On the other hand, there are a few La Quinta locations that seem to be far away from any Denny's. Montana, North Dakota, Louisiana, Kansas, Nevada, and southwest Texas are some geographic regions in which a blue circle is not close to a red cross.

The distance to the nearest Denny's for each La Quinta Inn

Imagine that a husband and wife are spending their retirement by crisscrossing the US. The wife wants to sleep each night at a La Quinta Inn. The husband wants to eat breakfast each morning at a Denny's restaurant. If they both get their wishes, how far will they need to travel to get breakfast each morning?

I've previously written about how to compute the distance to the nearest neighbor between observations that are in different groups. For these data, the La Quinta Inns form one group and the Denny's restaurants form a second group. Because the coordinates for these data are longitude and latitude, you need to use the GEODIST function in Base SAS to compute the distance ( in kilometers, as the crow flies) between each hotel and the nearest Denny's.

For each of the 851 hotels, you can find the distance to the nearest Denny's. The following table summarizes the distribution of distances, in kilometers:

lqdennys2

The table shows that the closest Denny 's is a mere 11 meters from the adjacent La Quinta Inn! For 25% of the inns, the nearest Denny's is within 1.3 km, which is a short walk. Fifty percent of the inns are within 5 km (an easy drive) of a Denny's, and 75% are within 17 km. It seems that the data supports a modified version of Hedberg's joke: “La Quinta” is Spanish for “often close to a Denny’s.”

The following histogram shows the distribution of the 851 distances from each La Quinta Inn to the nearest Denny's. The histogram shows that about 65% of the inns are within 10 km and about 77% are within 20 km. As long as the husband and wife stay at these inns, they will both be happy, well-rested, and well-fed!

lqdennys3

La Quinta Inns that are far from a Denny's

lqdennys4

Clearly the couple can happily sleep at a La Quinta Inn and eat at a Denny's provided that they avoid the few La Quinta Inns that are not located near a Denny's. The scatter plot and map near the top of this article gives some indications about where these inns are located, but we can identify these inns more precisely. For the sake of the couple's marital bliss, let's enumerate the inns that are farthest from a Denny's.

The adjacent table shows 10 La Quinta Inns that are farthest from a Denny's restaurant. The farthest distance is the La Quinta Inn in Glendive, Montana, which is 282 km from the nearest Denny's. La Quinta Inns in Kansas, Nevada, Texas, and Louisiana are also more than 150 km away from a Denny's.

If possible, the couple should avoid these inns, but what if their travels bring them through these cities? In that case, they need to know that distance and location to the nearest Denny's. The next graph might be helpful.

The following graph shows all the La Quinta Inns. The inns shown in red are those for which the distance to the nearest Denny's is more than 80 km away. For each of these inns, an arrow is drawn from the La Quinta Inn to the location of the nearest Denny's. (I explained this nearest-neighbor plot in a previous article.) This helps the couple know what direction they need to drive in order to reach breakfast—or perhaps lunch! For some La Quinta locations (MT, SC, TN) the couple will need to drive into an adjacent state.

Nearest Denny's for La Quinta Inns That Are Far from a Denny's

Summary

Mitch Hedberg's joke is funny because it has an element of truth. For most La Quinta Inns, the nearest Denny's restaurant is a short walk or drive away. By using nearest-neighbor computations and the GEODIST function in SAS, you can compute the distances from each inn to the nearest Denny's. You can graph the set of distances and compute statistics such as quantiles. You can visualize the co-locations of the inns and restaurants, and even direct travelers to the nearest Denny's. I agree with Rundel and Cetckaya-Rundel that this exercise provides a fun activity in data analysis for students.

For professional SAS programmers, the exercise demonstrates how to conduct a particular kind of co-location analysis. Instead of Denny's restaurants, the professional analyst might be interested in the distance to the nearest hospital, distribution center, or cell-phone tower. SAS statistical graphics and SAS/IML, provides the tools for analyzing the distance between groups of spatial data.

If you want to examine or extend my analysis of these data, you can download the SAS program.

Post a Comment

The top 10 posts from The DO Loop in 2016

I wrote 105 posts for The DO Loop blog in 2016. My most popular articles were about data analysis, SAS programming tips, and elementary statistics. Without further ado, here are the most popular articles from 2016.

Top 10 blog posts from The DO Loop in 2016 Click To Tweet

Data Analysis and Visualization

Ovarlay histograms

Start with a juicy set of data and an interesting question. Mix in some SAS data analysis and a colorful graph that visualizes the data and tells a story. What have you got? A popular blog post! The following posts generated buzz on social media. They also show off the power of SAS analytics and visualization.

General SAS programming techniques

selectwhen

Everyone who uses SAS needs to know tricks and techniques for programming in the DATA step or for working with macros. No wonder these articles were so popular!

Statistical Techniques

Moving average in SAS

If you browse SAS discussion forums, you will see many questions about computing moving statistics, creating dummy variables, and running weighted analyses. I wrote some articles about these topics that resonated with readers in 2016:

Did you miss any of these popular posts? Here's your chance to read (or re-read!) one of these top 10 posts from 2016.

Next week: My picks for articles that did not make this list, but deserve a second look.

Post a Comment

The contaminated normal distribution

How can you generate data that contains outliers in a simulation study? The contaminated normal distribution is a simple but useful distribution you can use to simulate outliers. The distribution is easy to explain and understand, and it is also easy to implement in SAS.

What is a contaminated normal distribution?

The contaminated normal distibution was originally studied by John Tukey in the 190s and '50s. As I say in my book Simulating Data with SAS (2013, p. 119), "the contaminated normal distribution is a specific instance of a two-component mixture distribution in which both components are normally distributed with a common mean.... This results in a distribution with heavier tails than normality. This is also a convenient way to generate data with outliers."

Specifically, a contaminated normal distribution is a mixture of two normal distributions with mixing probabilities (1 - α) and α, where typically 0 < α ≤ 0.1. You can write the density of a contaminated normal distribution in terms of the component densities. Let φ(x; μ, σ) denote the distribution of the normal distribution with mean μ and standard deviation σ. Then the contaminated normal density is
f(x) = (1 - α)φ(x; μ, σ) + α φ(x; μ, λσ)
where λ > 1 is a parameter that determines the standard deviation of the wider component.

The idea is that the "main" distribution (φ(x; μ, σ)) is slightly "contaminated" by a wider distribution. Tukey (1960) uses λ=3 as a scale multiplier. This article uses α = 0.1, which represents 10% "contamination." Tukey reports that when λ=3 and α=0.1, "the two constituents contribute equal amounts to the variance of the contaminated distribution." In the following sections, μ=0 and σ=1, so that the uncontaminated component is the standard normal distribution.

The density of the contaminated normal distribution

The following SAS DATA step constructs the density of a contaminated normal distribution as the linear combination of a N(0,1) and a N(0,3) density. The call to the SGPLOT procedure plots the density and the component densities:

%let alpha = 0.1;
%let lambda = 3;
data CNPDF;
label Y1="N(0,1)" Y2="N(0,3)";
do x = -3*&lambda to 3*&lambda by 0.1;
   Y1 = pdf("Normal", x, 0, 1);             /* std normal component */
   Y2 = pdf("Normal", x, 0, 1*&lambda);     /* contamination */
   CN = (1-&alpha)*Y1 + &alpha*Y2;          /* contaminated normal */
   output;
end;
run;
 
title "Contaminated Normal Distribution";
title2 "alpha = &alpha; lambda = &lambda";
proc sgplot data=CNPDF;
   label CN = "0.9 N(0,1) + 0.1 N(0,3)";
   series x=x y=Y1;
   series x=x y=Y2;
   series x=x y=CN / lineattrs=(thickness=3);
   xaxis grid values=(-9 to 9);
   yaxis grid;
run;
Contaminated normal density

As shown in the graph, the contaminated normal distribution (shown with a thick line) has heavier tails than the "uncontaminated" normal component.

Random samples from the contaminated normal distribution

The book Simulating Data with SAS (2013, p. 120) provides an algorithm for simulating data from a contaminated normal distribution. The algorithm is a special case of simulating from a mixture distribution. You iteratively choose a component with probability α and then generate a value from whichever component is chosen, as follows:

%let alpha= 0.1;               /* level of contamination */
%let lambda = 3;               /* magnitude of contamination */
%let N = 100;                  /* size of sample */
data CNRand(keep=x contaminate);
call streaminit(12345);
do i = 1 to &N;
   contaminate = rand("Bernoulli", &alpha);
   if contaminate then
      x = rand("Normal", 0, &lambda);
   else
      x = rand("Normal", 0, 1);
   output;
end;
run;
 
proc sgplot data=CNRand;
   histogram x;
   density x / type=normal(mu=0 sigma=1) name="normal";
   density x / type=kernel name="kernel";
   fringe x / group=contaminate  lineattrs=(thickness=2);
   keylegend "kernel" "normal" / location=inside position=topright across=1;
   yaxis offsetmin=0.035;
run;
Random sample from a contaminated normal distribution

The histogram shows the distribution of the simulated sample. A kernel density estimate is overlaid, as is the density for the uncontaminated N(0,1) component. A fringe plot (also called a "rug plot") is shown underneath the histogram so that you can see the actual values in the sample.

The data that are generated by the uncontaminated component are shown in one color; the data from the contaminated component are shown in a different color. Whereas the standard normal density rarely produces data values for which |x| > 3, the sample contains four values that exceed 3 in magnitude. As you can see, all four "extreme" values are generated from the contaminated component. However, the contaminated component also generates two values that are not so extreme.

CDF and quantiles for the contaminated normal distribution

You can easily compute the cumulative distribution (and therefore probabilities) of the contaminated normal (CN) distribution as a linear combination of the component CDFs. For example, if you want to know the probability that a random observation from the CN distribution exceeds 3 in magnitude, you can compute that probability as follows:

data Prob;
   x = 3;
   leftP  = (1-&alpha)*cdf("Normal", -x, 0, 1) +
                &alpha*cdf("Normal", -x, 0, &lambda);
   totalP = 2*leftP;     /* distribution is symmetric */
run;
proc print noobs;run;
CDF of contaminated normal distribution

The table shows that the probability is 0.034. Almost all of that probability comes from the contaminated component of the distribution.

The quantile function of the CN distribution does not have a closed-form solution. Given a probability P, the quantile of the CN is the value of x for which P = (1-α)CDF(x; μ, σ) + αCDF(x ; μ, λσ). You can use a root-finding method to find the quantile. For example, you can use the FROOT function in SAS/IML, as follows:

proc iml;
start CNQuantile(x) global (alpha, lambda, Prob);
   return Prob - ( (1-alpha)*CDF("Normal", x, 0, 1) + 
                      alpha *CDF("Normal", x, 0, lambda) );
finish;
 
Prob = 0.01708;
alpha = 0.1;
lambda = 3;
x = froot("CNQuantile", {-9 0});  /* find quantile for Prob */
print x;
Quantile of the contaminated normal distribution

The computation find that the quantile for 0.01708 is about -3.

Not all researchers agree that a contaminated normal distribution is an appropriate model for non-Gaussian data. Gleason (1993, JASA) provides an overview of the history of the CN distribution, discusses how the parameters contribute to the elongation of the tail, and compares the CN with other long-tailed distributions.

References

Post a Comment

Solve linear programming problems in SAS

In some applications, you need to optimize a linear objective function of many variables, subject to linear constraints. Solving this problem is called linear programming or linear optimization. This article shows two ways to solve linear programming problems in SAS: You can use the OPTMODEL procedure in SAS/OR software or use the LPSOLVE function in SAS/IML software.

Formulating the optimization

A linear programming problem can always be written in a standard vector form. In the standard form, the unknown variables are nonnegative, which is written in vector form as x0. Because the objective function is linear, it can be expressed as the inner product of a known vector of coefficients (c) with the unknown variable vector: cTx Because the constraints are also linear, the inequality constraints can always be written as Axb for a known constraint matrix A and a known vector of values b.

In practice, it is often convenient to be able to specify the problem in a non-standardized form in which some of the constraints represent "greater than," others represent "less than," and others represent equality. Computer software can translate the problem into a standardized form and solve it.

A linear programming problem example

As an example of how to solve a linear programming problem in SAS, let's pose a particular two-variable problem:

  • Let x = {x1, x2} be the vector of unknown positive variables.
  • Define the objective function to maximize as 3*x1 + 5*x2. In vector form, the objective function is cTx where c = {3, 5}.
  • Define the linear constraints as follows:
    3*x1 + -2*x2 ≤ 10
    5*x1 + 10*x2 ≤ 56
    4*x1 +  2*x2 ≥  7
    In matrix form, you can define the 3 x 2 matrix of constraint coefficients A = {3  -2, 5  10, 4  2} and the right-hand-side vector b = {10, 56, 7}. There is not a standard way to express a "vector of symbols," but we can invent one. Let LEG = {≤, ≤, ≥} be a vector of symbols. ("LEG" is a mnemonic for "Less than," "Equal to," or "Greater than.") With this notation, the vector of constraints is written as Ax LEG b.
Feasible region for linear program problem  in SAS. The optimal solution is marked.

The graph shows the polygonal region in the plane that satisfies the constraints for this problem. The region (called the feasible region) is defined by the two axes and the three linear inequalities. The color in the interior of the region indicates the value of the objective function within the feasible region. The green star indicates the optimal solution, which is x = {5.3, 2.95}.

The theory of linear programming says that an optimal solution will always be found at a vertex of the feasible region, which in 2-D is a polygon. In this problem, the feasible polygon has only five vertices, so you could evaluate the objective function at each vertex by hand to find the optimal solution. For high-dimensional problems, however, you definitely want to use a computer!

Linear programming in SAS by using the OPTMODEL procedure

The OPTMODEL procedure is part of SAS/OR software. It provides a natural programming language with which to define and solve all kinds of optimization problems. The linear programming example in this article is similar to the "Getting Started" example in the PROC OPTMODEL chapter about linear programming. The following statements use syntax that is remarkably similar to the mathematical formulation of the problem:

proc optmodel;
var x{i in 1..2} >= 0;           /* information about the variables */
max z = 3*x[1] +  5*x[2];        /* define the objective function */
con c1: 3*x[1] + -2*x[2] <= 10;  /* specify linear constraints */
con c2: 5*x[1] + 10*x[2] <= 56;
con c3: 4*x[1] +  2*x[2] >=  7;
solve;                           /* solve the LP problem */
print x;
Solution to linear programming problem with PROC OPTMODEL i nSAS/OR software

The OPTMODEL procedure prints two tables. The first (not shown) is a table that describes the algorithm that was used to solve the problem. The second table is the solution vector, which is x = {5.3, 2.95}.

For an introduction to using the OPTMODEL procedure to solve linear programming problems, see the 2011 paper by Rob Pratt and Ed Hughes.

Linear programming by using the LPSOLVE subroutine in SAS/IML

Not every SAS customer has a license for SAS/OR software, but hundreds of thousands of people have access to the SAS/IML matrix language. In addition to companies that license SAS/IML software, SAS/IML is part of the free SAS University Edition, which has been downloaded almost one million times by students, teachers, researchers, and self-learners.

Whereas the syntax in PROC OPTMODEL closely reflects the mathematical formulation, the SAS/IML language uses matrices and vectors to specify the problem. You then pass those matrices as arguments to the LPSOLVE subroutine. The subroutine returns the solution (and other information) in output arguments. The following list summarizes the information that you must provide:

  • Define the range of the variables: You can specify a vector for the lower bounds and/or upper bounds of the variables.
  • Define the objective function: Specify the vector of coefficients (c) such that c`*x is the linear objective function. Specify whether the goal is to minimize or maximize the objective function.
  • Specify the linear constraints: Specify the matrix of constraint coefficients (A), the right-hand side vector of values (b), and a vector (LEG) that indicates whether each constraint is an inequality or an equality. (The documentation calls the LEG vector the "rowsense" vector because it specifies the "sense" or direction of each row in the constraint matrix.)

The following SAS/IML program defines and solves the same LP problem as in the previous section. I've added plenty of comments so you can see how the elements in this program compare to the more compact representation in PROC OPTMODEL:

proc iml;
/* information about the variables */
LowerB = {0, 0};           /* lower bound constraints on x */
UpperB = {., .};           /* upper bound constraints on x */
 
/* define the objective function */
c = {3, 5};                /* vector for objective function c*x */
/* control vector for optimization */
ctrl = {-1,                /* maximize objective */
         1};               /* print some details */
 
/* specify linear constraints */
A = {3 -2,                 /* matrix of constraint coefficients */
     5 10,
     4  2};
b = {10,                   /* RHS of constraint eqns (column vector) */
     56,
      7};         
LEG = {L, L, G};           /* specify symbols for constraints:
                              'L' for less than or equal
                              'E' for equal
                              'G' for greater than or equal */
 
/* solve the LP problem */
CALL LPSOLVE(rc, objVal, result, dual, reducost, /* output variables */
             c, A, b,      /* objective and linear constraints */
             ctrl,         /* control vector */
             LEG, /*range*/ , LowerB, UpperB); 
print rc objVal, result[rowname={x1 x2}];
Solution to linear programming problem        by using the LPSOLVE function in SAS/IML software

In the call to LPSOLVE, the first five arguments are output arguments. The first three of these are printed:

  • If the call finds the optimal solution, then the return code (rc, first parameter) is 0.
  • The value of the objective function at the optimal value is returned in the second parameter (objVal).
  • The solution to the LP problem (the optimal value of the variables) is returned in the third argument.

Although the LPSOLVE function was not as simple to use as PROC OPTMODEL, it obtains the same solution. The LPSOLVE subroutine supports many features that are not mentioned here. For details, see the documentation for LPSOLVE.

The LPSOLVE subroutine was introduced in SAS/IML 13.1, which was shipped with SAS 9.4m1. The LPSOLVE function replaces the older LP subroutine, which is deprecated. SAS 9.3 customers can call the LP subroutine, which works similarly.

Post a Comment

Animate snowfall in SAS

Animated image of falling Koch snowflakes
Out of the bosom of the Air,
    Out of the cloud-folds of her garments shaken,
Over the woodlands brown and bare,
    Over the harvest-fields forsaken,
        Silent, and soft, and slow
        Descends the snow.

"Snow-flakes" by Henry Wadsworth Longfellow

Happy holidays to all my readers! In my last post I showed how to create a well-known fractal called the Koch snowflake. The snowflake is aptly named because it has six-fold symmetry. But as Longfellow noted, a real snowflake is not stationary, but descends "silent, and soft, and slow."

As a gift to my readers, I decided to create an animated greeting card entirely in SAS. The animated GIF (click to enlarge) uses some of the SAS techniques that I have blogged about in 2016. The falling and rotating snowflakes were created by using matrix computations in the SAS/IML language. The animated GIF was created by using ODS graphics and PROC SGPLOT.

Create an animated greeting card with #SAS Click To Tweet

Techniques used to create the animation

If you want to learn how I created this animated GIF with SAS, read on. I've blogged about all the essential techniques in previous posts:

You can download the SAS program that creates the greeting card. Let me know if you adapt this program to create other animated images.

If you like SAS statistical programming or want to learn more about it, subscribe to this blog. In most articles I show how to use SAS for serious pursuits like statistical modeling, data analysis, optimization, and more. But programming with SAS can also be fun, and sometimes it takes a less-serious application to make people say, "Wow! That's cool! I didn't know SAS could do that!"

Post a Comment

Create a Koch snowflake with SAS

Koch Snowflake in SAS

I have a fondness for fractals.

In previous articles, I've used SAS to create some of my favorite fractals, including a fractal Christmas tree and the "devil's staircase" (Cantor ) function. Because winter is almost here, I think it is time to construct the Koch snowflake fractal in SAS.

A Koch snowflake is shown to the right. The Koch snowflake has a simple construction. Begin with an equilateral triangle. For every line segment, replace the segment by four shorter segments, each of which is one-third the length of the original segment. The two middle segments are offset by 60 degrees from the direction of the original segment. You then iterate this process, with each step replacing every line segment with four others.

The Koch division process

One step in the construction of a Koch Snowflake

For simplicity, the graph at right shows the first step of the construction on a single line segment that has one endpoint at (0,0) and another endpoint at (1,0). The original horizontal line segment of length one is replaced by four line segments of length 1/3. The first and fourth segments are in the same direction as the original segment (horizontal). The second segment is rotated -60 degrees from the original direction. It, too, is 1/3 the original length. The third segment connects the second and fourth line segments.

The following program defines a function that uses the vector capabilities of SAS/IML software to carry out the fundamental step on a line segment. The function takes two row vectors as arguments, which are the (x,y) coordinates of the endpoints of the line segment. The function returns a 5 x 2 matrix, where the rows contain the endpoints of the four line segments that result from the Koch construction.

Two statements in the function might be unfamiliar. The direct product operator (@) quickly constructs points that are k/3 units along the original line segment for k=0, 1, 2, and 3. The ATAN2 function computes the direction angle for a two-dimensional vector.

proc iml;
/* divide a line segment (2 pts) into 4 segments (5 pts).  
   Create middle point by rotating through -60 degrees */ 
start KochDivide(A, E);                    /* (x,y) coords of endpoints */
   segs = j(5, 2, .);                      /* matrix to hold 4 shorter segs */
   v = (E-A) / 3;                          /* vector 1/3 as long as orig */
   segs[{1 2 4 5}, ] = A + v @ T(0:3);     /* endpoints of new segs */
   /* Now compute middle point. Use ATAN2 to find direction angle. */
   theta = -constant("pi")/3 + atan2(v[2], v[1]); /* change angle by pi/3 */
   w = cos(theta) || sin(theta);           /* vector to middle point */
   segs[3,] = segs[2,] + norm(v)*w;
   return segs;
finish;
 
/* test on line segment from (0,0) to (1,0) */
s = KochDivide({0 0}, {1 0});
title  "Fundamental Step in the Koch Snowflake Construction";
ods graphics / width=600  height=300; 
call series(s[,1], s[,2]) procopt="aspect=0.333";

Construct the Koch snowflake in SAS

Now that we have defined a function that can replace a segment by four shorter segments, let's define a function that applies that function repeatedly to each segment of a polygon. The following function takes two arguments: the polygon and the number of iterations. An N-sided closed polygon is represented as an (N+1) x 2 matrix of vertices.

/* create Koch Snowflake from and equilateral triangle */
start KochPoly(P0, iters=5);    
P = P0;
do j=1 to iters;
   N = nrow(P) - 1;             /* old number of segments */
   newP = j(4*N+1, 2);          /* new number of segments + 1 */
   do i=1 to N;                 /* for each segment... */
      idx = (4*i-3):(4*i+1);                    /* rows for 4 new segments */
      newP[idx, ] = KochDivide(P[i,], P[i+1,]); /* generate new segments */
   end;
   P = newP;                    /* update polygon and iterate */
end;
return P;
finish;

To test the function, define an equilateral triangle. The call to the KochPoly function creates a Koch snowflake from the equilateral triangle.

/* create equilateral triangle as base for snowflake */
pi = constant("pi");
angles = -pi/6 // pi/2 // 7*pi/6;  /* angles for equilateral triangle */
P = cos(angles) || sin(angles);    /* vertices of equilateral triangle */
P = P // P[1,];                    /* append first vertex to close triangle */
K = KochPoly(P);                   /* create the Koch snowflake */

The Koch snowflake that results from this iteration is shown at the top of this article. For completeness, here are the statements that write the snowflake to a SAS data set and graph it by using PROC SGPLOT:

S = j(nrow(K), 3, 1);      /* add ID variable with constant value 1 */
S[ ,1:2] = K;
create Koch from S[colname={X Y ID}];  append from S;  close;
QUIT;
 
ods graphics / width=400px height=400px;
footnote justify=Center "Koch Snowflake";
proc sgplot data=Koch;
   styleattrs wallcolor=CXD6EAF8;               /* light blue */
   polygon x=x y=y ID=ID / outline fill fillattrs=(color=white);
   xaxis display=none;
   yaxis display=none;
run;

Generalizations of the Koch snowflake

The KochPoly function does not check whether the polygon argument is an equilateral triangle. Consequently, use the function to "Koch-ify" any polygon! For example pass in the rectangle with vertices P = {0 0, 2 0, 2 1, 0 1, 0 0}; to create a "Koch box."

Enjoy playing with the program. Let me know if you generate an interesting shape!

Post a Comment

Simultaneous confidence intervals for a multivariate mean

Many SAS procedure compute statistics and also compute confidence intervals for the associated parameters. For example, PROC MEANS can compute the estimate of a univariate mean, and you can use the CLM option to get a confidence interval for the population mean. Many parametric regression procedures (such as PROC GLM) can compute confidence intervals for regression parameters. There are many other examples.

If an analysis provides confidence intervals (interval estimates) for multiple parameters, the coverage probabilities apply individually for each parameter. However, sometimes it is useful to construct simultaneous confidence intervals. These are wider intervals for which you can claim that all parameters are in the intervals simultaneously with confidence level 1-α.

This article shows how to use SAS to construct a set of simultaneous confidence intervals for the population mean. The middle of this article uses some advanced multivariate statistics. If you only want to see the final SAS code, jump to the last section of this article.

Compute simultaneous confidence intervals for the mean in #SAS. #Statistics Click To Tweet

The distribution of the mean vector

If the data are a random sample from a multivariate normal population, it is well known (see Johnson and Wichern, Applied Multivariate Statistical Analysis, 1992, p. 149; hereafter abbreviated J&W) that the distribution of the sample mean vector is also multivariate normal. There is a multivariate version of the central limit theorem (J&W, p. 152) that says that the mean vector is approximately normally distributed for random samples from any population, provided that the sample size is large enough. This fact can be used to construct simultaneous confidence intervals for the mean.

Recall that the most natural confidence region for a multivariate mean is a confidence ellipse. However, simultaneous confidence intervals are more useful in practice.

Confidence intervals for the mean vector

Before looking at multivariate confidence intervals (CI), recall that many a univariate two-sided CIs are symmetric intervals with endpoints b ± m*SE, where b is the value of the statistic, m is some multiplier, and SE is the standard error of the statistic. The multiplier must be chosen so that the interval has the appropriate coverage probability. For example, the two-sided confidence interval for the univariate mean is has the familiar formula xbar ± tc SE, where xbar is the sample mean, tc is the critical value of the t statistic with significance level α and n-1 degrees of freedom, and SE is the standard error of the mean. In SAS, you can compute tc as quantile("t", 1-alpha/2, n-1).

You can construct similar confidence intervals for the multivariate mean vector. I will show two of the approaches in Johnson and Wichern.

Hotelling T-squared statistic

As shown in the SAS documentation, the radii for the multivariate confidence ellipse for the mean are determined by critical values of an F statistic. The Hotelling T-squared statistic is a scaled version of an F statistic and is used to describe the distribution of the multivariate sample mean.

The following SAS/IML program computes the T-squared statistic for a four-dimensional sample. The Sashelp.iris data contains measurements of the size of petals and sepals for iris flowers. This subset of the data contains 50 observations for the species iris Virginica. (If you don't have SAS/IML software, you can compute the means and standard errors by using PROC MEANS, write them to a SAS data set, and use a DATA step to compute the confidence intervals.)

proc iml;
use sashelp.iris where(species="Virginica"); /* read data */
read all var _NUM_ into X[colname=varNames];
close;
 
n = nrow(X);               /* num obs (assume no missing) */
k = ncol(X);               /* num variables */
alpha = 0.05;              /* significance level */
xbar = mean(X);            /* mean of sample */
stderr = std(X) / sqrt(n); /* standard error of the mean */
 
/* Use T-squared to find simultaneous CIs for mean parameters */
F = quantile("F", 1-alpha, k, n-k);  /* critical value of F(k, n-k) */
T2 = k*(n-1)/(n-k) # F;              /* Hotelling's T-squared is scaled F */
m = sqrt( T2 );                      /* multiplier */
Lower = xbar - m # stdErr;
Upper = xbar + m # stdErr;
T2_CI = (xbar`) || (Lower`) || (Upper`);
print T2_CI[F=8.4 C={"Estimate" "Lower" "Upper"}  R=varNames];
t_bonferroni1

The table shows confidence intervals based on the T-squared statistic. The formula for the multiplier is a k-dimensional version of the 2-dimensional formula that is used to compute confidence ellipses for the mean.

Bonferroni-adjusted confidence intervals

It turns out that the T-squared CIs are conservative, which means that they are wider than they need to be. You can obtain a narrower confidence interval by using a Bonferroni correction to the univariate CI.

The Bonferroni correction is easy to understand. Suppose that you have k MVN mean parameters that you want to cover simultaneously. You can do it by choosing the significance level of each univariate CI to be α/k. Why? Because then the joint probability of all the parameters being covered (assuming independence) will be (1 - α/k)k, and by Taylor's theorem (1 - α/k)k ≈ 1 - α when (α/k) is very small. (I've left out many details! See J&W p. 196-199 for the full story.)

In other words, an easy way to construct simultaneous confidence intervals for the mean is to compute the usual two-sided CIs for significance level α/k, as follows:

/* Bonferroni adjustment of t statistic when there are k parameters */
tBonf = quantile("T", 1-alpha/(2*k), n-1);  /* adjusted critical value */
Lower = xbar - tBonf # stdErr;              
Upper = xbar + tBonf # stdErr;
Bonf_CI = (xbar`) || (Lower`) || (Upper`);
print Bonf_CI[F=8.4 C={"Estimate" "Lower" "Upper"} R=varNames];
t_bonferroni1

Notice that the confidence intervals for the Bonferroni method are narrower than for the T-square method (J&W, p. 199).

The following graph shows a scatter plot of two of the four variables. The sample mean is marked by an X. For reference, the graph includes a bivariate confidence ellipse. The T-squared confidence intervals are shown in blue. The thinner Bonferroni confidence intervals are shown in red.

Bonferroni and T-squared simultaneous confidence intervals for the mean of four-dimensional iris data

Compute simultaneous confidence intervals for the mean in SAS

The previous sections have shown that the Bonferroni method is an easy way to form simultaneous confidence intervals (CIs) for the mean of multivariate data. If you want the overall coverage probability to be at most (1 - α), you can construct k univariate CIs, each with significance level α/k.

You can use the following call to PROC MEANS to construct simultaneous confidence intervals for the multivariate mean. The ALPHA= method enables you to specify the significance level. The method assumes that the data are all nonmissing. If your data contains missing values, use listwise deletion to remove them before computing the simultaneous CIs.

/* Bonferroni simultaneous CIs. For k variables, specify alpha/k 
   on the ALPHA= option. The data should c ontain no missing values. */
proc means data=sashelp.iris(where=(species="Virginica")) nolabels
     alpha=%sysevalf(0.05/4)  /* use alpha/k, where k is number of variables */
     mean clm maxdec=4;
var SepalLength SepalWidth PetalLength PetalWidth;  /* k = 4 */
run;
t_bonferroni3

The values in the table are identical to the Bonferroni-adjusted CIs that were computed earlier. The values in the third and fourth columns of the table define a four-dimensional rectangular region. For 95% of the random samples drawn from the population of iris Virginica flowers, the population means will be contained in the regions that are computed in this way.

Post a Comment