Packages: A new way to share SAS/IML programs

My previous post highlighted presentations at SAS Global Forum 2016 that heavily used SAS/IML software. Several of the authors clearly want to share their work with the wider SAS analytical community. They include their SAS/IML program in an appendix or mention a web site or email address from which the reader can obtain the SAS/IML functions that implement the analysis.

As of SAS/IML 14.1, there is an easy way to share SAS/IML functions with others. The IML language now supports packages through the PACKAGE statement. A package is a collection of files that contains source code, documentation, example programs, and sample data. Packages are distributed as ZIP files.

This article shows how to install and use a package. A subsequent article will discuss how to author a package. For more information, see my 2016 SAS Global Forum paper, "Writing Packages: A New Way to Distribute and Use SAS/IML Programs".

The main steps to use a package are as follows:

  1. Download the ZIP file that contains a package.
  2. To install the package, use the PACKAGE INSTALL statement from within PROC IML or SAS/IML Studio.
  3. To learn how to use the package, read the package documentation and run some of the sample programs that the package provides.
  4. To use the functions in the package, use the PACKAGE LOAD statement to load the functions into your current SAS/IML session.
  5. To access sample data that the package provides, use the PACKAGE LIBNAME statement.

In the following sections, each step is illustrated by using the polygon package, which contains functions that compute geometric properties of planar polygons. You can download the package from the SAS/IML File Exchange.

Obtain the package

The SAS/IML File Exchange is the recomended location to obtain packages. However, some authors might choose to post packages on GitHub or on their own web site.

Download the polygon package and save it to a location that SAS can access. This article assumes that the ZIP file is saved to the location C:\Packages\polygon.zip on a Windows PC.

Install the package

To install the package, run the following statements in SAS/IML 14.1 or later:

proc iml;
package install "C:\Packages\polygon.zip"; 
quit;

You only need to install a package one time. You can then call it in future PROC IML sessions. The package remains installed even after you exit from SAS and reboot your computer.

Read the documentation

The help directory in the ZIP file contains the full documentation for the package, often in the form of a PDF file. You can read that documentation by using a PDF reader outside of SAS. The PDF documentation might contain diagrams, equations, graphs, and other non-text information.

Authors should also provide a brief text-only version of the syntax. While running SAS, you can display the text-only version in the SAS log, as follows:

proc iml;
package help polygon;

For the polygon package, the SAS log contains the following overview of the package and the syntax for every public module.

Polygon Package
Description: Computational geometry for polygons
 
A polygon is represented as an n x 2 matrix, where the rows represent
adjacent vertices for the polygon. The polygon should be "open,"
which means that the last row does not equal the first row.
<...More text...>
 
Module Syntax:
PolyArea(P);
   returns the areas of simple polygons.
 
<...documentation for nine other functions...>
 
PolyWindingNumber(P, R);
   returns the winding number of the polygon R around the point R.

The purpose of the PACKAGE HELP syntax is to remind the user of the syntax of the functions in the package while inside of SAS.

Another way to learn to use a pakage is to examine and run sample programs that the package author provides. For the polygon package, you can look in the programs directory. The file Example.sas demonstrates calling functions in the package.

Load the functions

To read the functions in the package into the current IML session, use the PACKAGE LOAD statement, as follows:

package load polygon;

The SAS log will display a NOTE for each modules that is loaded:

NOTE: Module _POLYAREA defined.
NOTE: Module POLYAREA defined.
<...More text...>
NOTE: Module POLYSTACK defined.

Notice that all function begin with a common prefix, in this case "POLY" (or "_POLY," for internal-only functions). This is a good programming practice because it reduces the likelihood that functions in one package conflict with functions in another package. For example, if you load two packages that each have a function named "COMPUTE" (or an equally generic name), then there will be a comflict. By using "POLY" as a prefix, there is less likely to be a name collision.

Call the functions

After the functions are loaded, you can call them from your program. For example, the following statements define the four vertices of a rectangle and then call functions from the package to compute the perimeter, area, and whether the polygon is a convex region.

P = {0 0, 1 0, 1 2, 0 2};         /* vertices of rectangle */
Perimeter = PolyPerimeter(P);
Area      = PolyArea(P);
IsConvex  = PolyIsConvex(P);
print Perimeter Area IsConvex;
package1

Use sample data

Some packages include sample data sets. You can use the PACKAGE LIBNAME statement to create a SAS libref that points to the data subdirectory of an installed package. The following statements read in sample data for polygons that are defined in the Simple data set. After the vertices of the polygons are read, the PolyDraw subroutine is called to visualize the polygons in the data set.

package libname PolyData polygon;    /* define libref */
use PolyData.Simple;                 /* sample data in package */
read all var {u v ID} into P;        /* vertices; 3rd col is ID variable */
close PolyData.Simple;
run PolyDraw(P);                     /* visualize the polygons */
package2

Display package information

You can use the PACKAGE LIST statement to print the name and version of all installed packages. You can use the PACKAGE INFO command to obtain details about a specific package. The output for these statements is not shown.

package list;                /* list all installed packages */ 
package info polygon;        /* details about polygon package */

Summary

This article briefly describes how to install and use a SAS/IML package. Packages were introduced in SA/IML 14.1, which was released with SAS 9.4m3 in July 2015. This article shows how to download, install, load, and use packages that are written by others.

For additional details about packages, see Wicklin (2016). The official documentation of packages is contained in the "Packages" chapter in the SAS/IML User's Guide.

Post a Comment

Matrix computations at SAS Global Forum 2016

Last week I attended SAS Global Forum 2016 in Las Vegas. I and more than 5,000 other attendees discussed and shared tips about data analysis and statistics.

Naturally, I attended many presentations that featured using SAS/IML software to implement advanced analytical algorithms. Several speakers showed impressive mastery of SAS/IML programming techniques. In all, almost 30 papers stated that they used SAS/IML for some or all of the computations.

The following presentations were especially impressive. As you can see, topics include Bayesian analysis, signal processing, spatial analysis, simulation, and logistic regression.

  • Jason Bentley, a PhD student at The University of Sydney, presented a paper about Bayesian inference for complex hierarchical models with smoothing splines (joint work with Cathy Lee). Whereas Markov-chain Monte Carlo techniques are "computationally intensive, or even infeasible" for large hierarchical models, Bentley coded a "fast deterministic alternative to MCMC." His SAS/IML code runs in 22 seconds on a hierarchical model that includes 350 schools and approximately 88,000 students. Jason says that this was his first serious SAS/IML program, but that he found the language "ideal" for implementing the algorithm.
  • Woranat Wongdhamma at Oklahoma State University used wavelet analysis in SAS/IML to construct a predictive model that helps clinical researchers diagnose obstructive sleep apnea. I did not realize that 1 in 15 adult Americans have moderate or severe sleep apnea, and many of these people are undiagnosed. Hopefully Wongdhamma's analysis will help expedite the diagnosis process.
  • dasilva Alan Ricardo da Silva, a professor at the University of Brasilia, has written many papers that feature SAS/IML programs. This year (with T. Rodrigues), Alan showed how to implement an algorithm for geographically weighted negative binomial regression. He applied this spatial analysis method to freight transportation in Brazil. The image at right is taken from his paper.
  • A second paper by da Silva (with co-author Paulo da Silva) shows how to use SAS/IML to simulate data from a skew normal and a skew t distribution.
  • Gongwei Chen used SAS/IML to build a discrete-time Markov chain model to forecast workloads at the Washington State Department of Corrections. Social workers and payroll officers supervise convicted offenders who have been released from prison or were sentenced to supervision by the court system. Individuals who exhibit good behavior can transition from a highly supervised situation into less supervision. Other individuals might commit a new offense that requires an increase in supervision. Chen's probabilistic model helps the State of Washington forecast the resources needed for supervising offenders.
  • Katherine Cai and Jeffrey Wilson at Arizona State University used SAS/IML to build a logistic model of longitudinal data with time-dependent covariates. The algorithm was applied to obesity data and to children’s health data in the Philippines.
  • Another paper by Wilson (co-authored with Kyle Irimata) used SAS/IML to implement exact logistic regression for nested binary data.

The SAS/IML language was used in many other e-posters and presentations that unfortunately I could not attend due to other commitments. All of the papers for the proceedings are available online, so you can search the proceedings for "IML" or any other topic interests you.

Experienced SAS programmers recognize that when an analysis is not available in any SAS procedure, the SAS/IML language is often the best choice for implementing the analysis. These papers and my discussions with SAS customers indicate that the SAS/IML language is actively used for advanced analytical models that require simulation, optimization, and matrix computations. Because SAS/IML is part of the free SAS University Edition, which has been downloaded 500,000 times, I expect the number of SAS/IML users to continue to grow.

Post a Comment

Visualize missing data in SAS

You can visualize missing data. It sounds like an oxymoron, but it is true.

How can you draw graphs of something that is missing? In a previous article, I showed how you can use PROC MI in SAS/STAT software to create a table that shows patterns of missing data in SAS. In addition to tables, you can use graphics to visualize patterns of missing data.

Counts of observations that contain missing values

As shown in the previous post, it is useful to be able to count missing values across rows as well as down columns. These row and column operations are easily accomplished in a matrix language such as SAS/IML. For example, the following SAS/IML statement read in data from the Sashelp.Heart data set. A single call to the COUNTMISS function counts the number of missing values in each row. The BAR subroutine creates a bar chart of the results:

proc iml;
varNames = {AgeAtStart Height Weight Diastolic 
            Systolic MRW Smoking Cholesterol};
use Sashelp.Heart;                         /* open data set */
read all var varNames into X;              /* create numeric data matrix, X */
close Sashelp.Heart;
 
title "Counts of Rows That Contain 0, 1, 2, or 3 Missing Values";
title2 "Sashelp.Heart Data";
Count = countmiss(X,"row");            /* count missing values for each obs */
call Bar(Count);                       /* create bar chart */
Visualize Missing Data in SAS: Count of observations that contain missing values

The bar chart clearly shows that most observations do not contain any missing values among the specified variables. A small percentage of observations contain one missing value. Even fewer contain two or three missing values.

Which observations contain missing values?

It can be useful to visualize the locations of observations that contain missing values. Are missing values spread uniformly at random throughout the data? Or do missing values appear in clumps, which might indicate a systematic issue with the data collection?

The previous section computed the Count variable, which is a vector that has the same number of elements as there are rows in the data. To visualize the locations (rows) of missing values, use the LOC function to find the rows that have at least one missing value and plot the row indices:

missRows = loc(Count > 0);                  /* which rows are missing? */
title "Location of Rows That Contain Missing Values";
call histogram( missRows ) scale="count"    /* plot distribution */
     rebin={125, 250}                       /* bin width = 250 */
     label="Row Number";                    /* label for X axis */
Visualize Missing Data in SAS: Location of rows that contain missing values

This histogram is very revealing. The bin width is 250, which means that each bar includes 250 observations. For the first 2000 observations, about 15 of every 250 observations contain a missing value. Then there is a series of about 500 observations that do not contain any missing observations. For the remaining observations, only about three of every 250 observations contain missing values. It appears that the prevalence of missing values changed after the first 2000 observations. Perhaps the patients in this data set were entered according to the date in which they entered the program. Perhaps after the first 2000 patients were recruited, there was a change in data collection that resulted in many fewer missing values.

A heat map of the missing values

The ultimate visualization of missing data is to use a heat map to plot the entire data matrix. You can use one color (such as white) to represent nonmissing elements and another color (such as black) to represent missing values.

For many data sets (such as Sashelp.Heart), missing observations represent a small percentage of all rows. For data like that, the heat map is primarily white. Therefore, to save memory and computing time, it makes sense to visualize only the rows that have missing values. In the Sashelp.Heart data (which has 5209 rows), only 170 rows have missing values. For each of the 170 rows, you can plot which variables are missing.

The following statements implement this visualization of missing data in SAS. The matrix Y contains only the rows of the data matrix for which there is at least one missing value. You can call the CMISS function in Base SAS to obtain a binary matrix with values 0/1. The HEATMAPDISC subroutine in SAS/IML enables you to create a heat map of a matrix that has discrete values.

ods graphics / width=400px height=600px;
Y = X[missRows,];              /* extract missing rows   */
call HeatmapDisc( cmiss(Y) )   /* CMISS returns 0/1 matrix */
     displayoutlines=0 
     colorramp={white black}
     xvalues=VarNames          /* variable names along bottom */
     yvalues=missRows          /* use nonmissing rows numbers as labels */
     showlegend=0 title="Missing Value Pattern";
Visualize Missing Data in SAS: Heat Map that Shows Missing Value Pattern

The black line segments represent missing values. This heat map summarizes almost all of the information about missing values in the data. You can see which variables have no missing values, which have a few, and which have many. You can see that the MRW variable is always missing when the Weight variable is missing. You can see that the first 2250 observations have relatively many missing values, that there is a set of observations that are complete, and that the missing values occur less frequently for the last 2000 rows.

If you do not have SAS/IML software, you can still create a discrete heat map by using the Graph Template Language (GTL).

Be aware that this heat map visualization is limited to data for which the number of rows that have missing values is somewhat small. (About 1000 rows or less.) In general, a heat map should contain at least one vertical pixel for each data row that you want to visualize. For the Sashelp.Heart data, there are 170 rows that have missing data, and the heat map in this example has 600 vertical pixels. If you have thousands of rows and try to construct a heat map that has only hundreds of pixels, then multiple data rows must be mapped onto a single row of pixels. The result is not defined, but certainly not optimal.


Do you ever visualize missing data in SAS? What techniques do you use? Has visualization revealed something about your data that you hadn't known before? Leave a comment.

Post a Comment

Examine patterns of missing data in SAS

Missing data can be informative. Sometimes missing values in one variable are related to missing values in another variable. Other times missing values in one variable are independent of missing values in other variables. As part of the exploratory phase of data analysis, you should investigate whether there are patterns in the missing data.

Counting is the simplest analysis of any data (missing or otherwise). For each variable, how many observations are missing? You can then proceed to more complex analyses. Do two variable share a similar missing value pattern? Can you predict that one variable will be missing if you know that another variable is missing?

This article shows a simple way to examine patterns of missing values in SAS. The example data is the Sashelp.Heart data set, which contains information about patients in the Framingham Heart Study. The meaning of most variables is evident from the variable's name. An exception is the MRW variable, which contains a patient's "Metropolitan Relative Weight." The MRW is a percentage of the patient's weight to an ideal weight. Thus an MRW score of 100 means "ideal weight" whereas a score of 150 means "50% heavier than ideal." The MRW is similar to the more familiar body-mass index (BMI).

The number of missing values for each variable

I have previously shown how to use PROC FREQ to count the number of missing values for numeric and character variables. The technique uses a custom format to classify each value as "Missing" or "Not Missing."

Alternatively, for numeric variables you can use PROC MEANS to count the number of missing values. PROC MEANS creates a compact easy-to-read table that summarizes the number of missing values for each numerical variable. The following statements use the N and NMISS options in the PROC MEANS statement to count the number of missing values in eight numerical variables in the Sashelp.Heart data set:

/* count missing values for numeric variables */
proc means data=Sashelp.Heart nolabel N NMISS;
var AgeAtStart Diastolic Systolic Height Weight MRW 
    Smoking Cholesterol;
run;
missingpatterns1

The NMISS column in the table shows the number of missing values for each variable. There are 5209 observations in the data set. Three variables have zero missing values, and another three have six missing values. The Smoking variable has 36 missing values whereas the Cholesterol variable has 152 missing values.

These univariate counts are helpful, but they do not tell you whether missing values for different variables are related. For example, there are six missing values for the Height, Weight, and MRW variables. How many patients contributed to those six missing value? Ten? Twelve? Perhaps there are only six patients, each with missing values for all three variables? If a patient has a missing Height, does that imply that Weight or MRW is missing also? To answer these questions you must look at the pattern of missing values.

Patterns of missing values

The MI procedure in SAS/STAT software is used for multiple imputation of missing values. PROC MI has an option to produce a table that summarizes the patterns of missing values among the observations. The following call to PROC MI uses the NIMPUTE=0 option to create the "Missing Data Patterns" table for the specified variables:

ods select MissPattern;
proc mi data=Sashelp.Heart nimpute=0;
var AgeAtStart Height Weight Diastolic 
    Systolic MRW Smoking Cholesterol;
run;
Missing Data Patterns table from PROC MI

The table reports the number of observations that have a common pattern of missing values. In addition to counts, the table reports mean values for each group. Because the table is very wide, I have truncated some of the mean values.

The first row counts the number of complete observation. There are 5039 observations that have no missing values.

Subsequent rows report the number of missing values for variables, pairs of variables, triplets of variables, and so forth. The variables are analyzed from right to left. Thus the second row shows that 124 observations have missing values for only the rightmost variable, which is Cholesterol. The third row shows that eight observations have missing values for only the Smoking variable. The fourth row shows that 28 observations have missing values for both Smoking and Cholesterol.

The table continues by analyzing the remaining variables that have missing values, which are Height, Weight, and MRW. You can see that MRW is never missing by itself, but that it is always missing when Weight is missing. Height is missing by itself in four cases, and is missing simultaneously with Weight in two cases.

Notice that each row of the table represents a disjoint set of observations. Consequently, we can easily answer the previous questions about how many patients contribute to the missing values in Height and Weight. There are 10 patients: four are missing only the Height measurement, four are missing only the Weight measurement (which forces MRW to be missing), and two are missing both Height and Weight.

This preliminary analysis has provided important information about the distribution of missing data in the Sashelp.Heart data set. Most patients (96.7%) have complete data. The most likely measurement to be missing is Cholesterol, followed by information about whether the patient smokes. You can see exactly how many patients are missing Height measurements, Weight measurements, or both. It is obvious that the MRW variable has a missing value if and only if the Weight variable is missing.

The "Missing Data Patterns" table from PROC MI provides a useful summary of missing values for each combination of variables. Examining patterns of missing values can lead to insight into the data collection process, and is also the first step prior to modeling missing data by using multiple imputation. In my next post, I will show how to use basic graphics to visualize patterns of missing data.

Post a Comment

Head-tail versus head-head: A counterintuitive property of coin tosses

I saw an interesting mathematical result in Wired magazine. The original article was about mathematical research into prime numbers, but the article included the following tantalizing fact:

If Alice tosses a [fair] coin until she sees a head followed by a tail, and Bob tosses a coin until he sees two heads in a row, then on average, Alice will require four tosses while Bob will require six tosses (try this at home!), even though head-tail and head-head have an equal chance of appearing after two coin tosses.

Yes, I would classify that result as counter-intuitive! When I read this paragraph, I knew I had to simulate these coin tosses in SAS.

Why does head-tail happen sooner (on average)?

Define the "game" for Alice as a sequence of coin tosses that terminates in a head followed by a tail. For Bob, the game is a sequence of coin tosses that terminates in two consecutive heads.

You can see evidence that the result is true by looking at the possible sequence of tosses that end the game for Alice versus Bob. First consider the sequences for Alice that terminate with head-tail (HT). The following six sequences terminate in four or fewer tosses:

  HT
 THT
 HHT
THHT
TTHT
HHHT

In contrast, among Bob's sequences that terminate with head-head (HH), only four require four or fewer tosses:

  HH
 THH
TTHH
HTHH

For Bob to "win" (that is, observe the sequence HH), he must first toss a head. On the next toss, if he obtains a head again, he wins. But if he observes a tail, he is must start over: He has to toss until a head appears, and the process continues.

In contrast, for Alice to "win" (that is, observe the sequence HT), she must first toss a head. If the next toss shows a tail, she wins. But if is a head, she doesn't start over. In fact, she is already halfway to winning on the subsequent toss.

This result is counterintuitive because a sequence is just as likely to terminate with HH as with HT. So Alice and Bob "win" equally often if they are observing flips of a single fair coin. However, the sequences for which Alice wins tend to be shorter than those for which Bob wins.

A simulation of coin tosses

The following SAS DATA step simulates Alice and Bob tossing coins until they have each won 100,000 games. For each player, the program counts how many tosses are needed before the "winning" sequence occurs. A random Bernoulli number (0 or 1 with probability 0.5) simulates the coin toss.

data CoinFlip(keep= player toss);
call streaminit(123456789);
label Player = "Stopping Condition";
do i = 1 to 100000;
   do player = "HT", "HH";               /* first Alice tosses; then Bob */
      first = rand("Bernoulli", 0.5);    /* 0 is Tail; 1 is Head */
      toss = 1;  done = 0;
      do until (done);
         second = rand("Bernoulli", 0.5);
         toss = toss + 1;                /* how many tosses until done? */
         if player = "HT" then           /* if Alice, ...        */
            done = (first=1 & second=0); /* stop when HT appears */
         else                            /* if Bob, ...          */
            done = (first=1 & second=1); /* stop when HH appears */
         first = second;                 /* remember for next toss */
      end;
      output;
   end;
end;
run;
 
proc means data=CoinFlip MAXDEC=1 mean median Q3 P90 max;
   class player;
   var toss;
run;
Sample statistics for number of tosses required to end a simulated coin game

The MEANS procedure displays statistics about the length of each game. The mean (expected) length of the game for Bob (HH) is 6 tosses. 75% of games were resolved in eight tosses or less and 90% required 12 or less. In the 100,000 simulated games, the longest game that Bob played required 61 tosses before it terminated. (The maximum game length varies with the random number seed.)

In contrast, Alice resolved her 100,000 games by using about 50% fewer tosses. The mean (expected) length of the game for Alice (HT) is 4 tosses. 75% of the games required five or fewer tosses and 90% were completed in seven or fewer tosses. In the 100,000 simulated games, the longest game that Alice played required 22 tosses.

Distribution of tosses for two games with different winning criteria

You can create a comparative histogram to compare the distribution of the game lengths. The distribution of the length of Bob's games is shown in the upper histogram. It has a long tail.

In contrast, the distribution of Alice's game lengths is shown in the bottom of the panel. The distribution drops off much more quickly than for Bob's games.

It is also enlightening to see the empirical cumulative distribution functions (ECDFs) for the number of tosses for each player. You can download a SAS program that creates the comparative histogram and the ECDFs. For skeptics, the program also contains a simulation that demonstrates that a sequence terminates with HH as often as it terminates with HT.

The result is not deep, but it reminds us that the human intuition gets confused by conditional probability. Like the classic the Monty Hall problem, simulation can convince us that a result is true, even when our intuition refuses to believe it.

Post a Comment

Set attributes of markers in PROC SGPLOT by using ODS style elements

The SG procedures in SAS use aesthetically pleasing default colors, shapes, and styles, but sometimes it is necessary to override the default attributes. The MARKERATTRS= option enables you to override the default colors, symbols, and sizes of markers in scatter plots and other graphs. Similarly, the LINEATTRS= option enables you to override the default line colors, thickness, and pattern (solid, dashed, dotted, and so forth).

Many programmers specify hard-coded values such as MARKERATTRS=(COLOR=red SYMBOL=CircleFilled). However, sometimes it is important for colors and symbols to match the style elements in the current ODS style. For example, you might want to match the colors and symbols that appear in some other graph.

An example is shown below. The data set is the Sashelp.Cars data, which contains information on car and trucks. The first scatter plot displays a scatter plot where each marker is colored according to the value of the Origin variable, which has the values "Asia," "Europe," and "USA." The second scatter plot uses the WHERE statement to restrict the display to vehicles that were manufactured in the USA. For simplicity, only the most important statements are shown. You can download the complete SAS program that generates these graphs.

title "Vehicles: All Origins"; 
proc sgplot data=sashelp.cars;
  scatter x=wheelbase y=weight / markerattrs=(symbol=CircleFilled) group=origin;
run;
 
title "Vehicles: Origin=USA only";
proc sgplot data=sashelp.cars;
  where origin="USA";
  scatter x=wheelbase y=weight / markerattrs=(symbol=CircleFilled);
run;
odsstyleattrs1

The graphs look the way they do because the SG procedures follow certain rules for displaying grouped versus non-grouped data. The first graph has three groups. The GROUP= option causes the SGPLOT procedure to assign the ODS element GraphData1 to the first group of markers ("Asia"). Similarly, the ODS element GraphData2 is assigned to the second group ("Europe") and the GraphData3 element is assigned to the third group ("USA"). (These names are described in the documentation for ODS style elements.) In the HTMLBlue style, which is the style used for these graphs, each markers is colored a shade of blue, red, or green according to its group membership.

The second plot shows only the "USA" vehicles. Whereas these markers were colored green in the first plot, they are colored blue in the second plot. This shade of blue comes from the GraphDataDefault element. The marker color is GraphDataDefault:ContrastColor, the marker size is GraphDataDefault:MarkerSize, and the marker symbol is GraphDataDefault:MarkerSymbol.

If you are displaying both graphs, you would certainly want the markers in the second graph to use the same shade of green as the "USA" vehicles in the first graph. This shade of green does not have a standard name, so you cannot specify a hard-coded color such as MARKERATTRS=(color=DarkGreen). Instead, use the MARKERATTRS= option to specify the marker attributes in the GraphData3 element, as follows:

proc sgplot data=sashelp.cars;
  where origin="USA";
  scatter x=wheelbase y=weight / markerattrs=GraphData3;
run;
odsstyleattrs2

The MARKERATTRS= option tells the scatter plot to use markers that have the color, symbol, and size of the GraphData3 style element. For the default HTMLBlue style, that means dark green and open circles, as shown.

The second plot now looks similar to the "USA" vehicles in the first plot. Using the GraphData3 style changed the color, but the symbol is incorrect. To override an attribute of a style element, specify the attribute in parentheses AFTER the style element. For this example, specify the SYMBOL= option in parentheses, as follows:
proc sgplot data=sashelp.cars;
  where origin="USA";
  scatter x=wheelbase y=weight / markerattrs=GraphData3(symbol=CircleFilled);
run;
odsstyleattrs3

Voila! The task is complete. The second plot now has the same marker attributes as the first plot. ( View the plots side by side.) If you were to change the ODS style from HTMLBlue to some other style, the colors would still match.

Notice that PROC SGPLOT does not support specifying the color by using the name of the style element. For example, the following is invalid in PROC SGPLOT:

scatter x=x y=y / markerattrs=(GraphData3:ContrastColor); /* NOT supported */

The Graph Template Language (GTL) supports specifying style elements using the "colon notation," but that syntax is not supported in the SG procedures.

The SAS documentation contains two sections that explain ODS style elements and how to use them in the SG procedures:

As indicated above, the same ideas apply to line attributes, region attributes, and so forth. For example, in a SERIES statement you can use LINEATTRS=GraphFit2(thickness=3) to assign attributes of the GraphFit2 style and ovverride the THICKNESS attribute.

Thanks to Sanjay Matange and Dan Heath for several discussions that improved my understanding of ODS style elements and attributes.

Post a Comment

Generate points uniformly inside a d-dimensional ball

Last week I showed how to generate random points uniformly inside a 2-d circular region. That article showed that the distance of a point to the circle's center cannot be distributed uniformly. Instead, you should use the square root of a uniform variate to generate 2-D distances to the origin. This result generalizes to higher dimensions. If you want to simulate points uniformly in the d-dimensional ball, generate the radius to be proportional to the dth-root of a uniform random variate.

Although it is possible to generalize the polar mapping and obtain a parameterization of the d-dimensional ball, there is an easier approach to simulate points in a ball. The basic idea is to simulate d independent standardized normal variables, project them radially onto the unit sphere, and then adjust their distance to the origin appropriately. You can find this algorithm in many textbooks (Devroye, 1986; Fishman, 1996), but Harman and Lacko (2010) summarize the process nicely. To paraphrase:

If Y is drawn from the uncorrelated multivariate normal distribution, then S = Y / ||Y|| has the uniform distribution on the unit d-sphere. Multiplying S by U1/d, where U has the uniform distribution on the unit interval (0,1), creates the uniform distribution in the unit d-dimensional ball.

It is easy to carry out this operation in SAS. In Base SAS, you can write a DATA step that uses arrays. The following program provides a SAS/IML solution. I used SAS/IML Studio to create a 3-D rotating scatter plot of the simulated cloud of points. You can download the SAS program that contains the DATA step and the IMLPlus program that creates the animation.

proc iml;
call randseed(12345);
d = 3;                               /* dimension = number of variables */
N = 1000;                            /* sample size = number of obs     */
radius = 2;                          /* radius of circle */
Y = randfun(N // d, "Normal");       /* Y ~ MVN(0, I(d)) */
u = randfun(N, "Uniform");           /* U ~ U(0,1)       */
r = radius * u##(1/d);               /* r proportional to d_th root of U */
X = r # Y / sqrt(Y[,##]);            /* Y[,##] is sum of squares for each row */
/* X contains N random uniformly distributed points in the d-ball */
Sim3D
Post a Comment

The WHERE clause in SAS/IML

In SAS procedures, the WHERE clause is a useful way to filter observations so that the procedure receives only a subset of the data to analyze. The IML procedure supports the WHERE clause in two separate statements.

  • On the USE statement, the WHERE clause acts as a global filter. The where clause applies to all subsequent READ statements that read from the open data set.
  • On the READ statement, the WHERE clause further filters the data. Because you can use multiple READ statements, you can easily create matrices that contain disjoint or overlapping subsets of the data. However, be aware multiple READ statements might result in SAS reading the data multiple times, depending on the syntax that you specify.

An interesting fact about the WHERE clause in SAS/IML is that you can specify run-time expressions for the WHERE clause, which makes it a very powerful tool for data analysis.

The WHERE clause in the USE statement

The USE statement opens a SAS data set for read access. When you need only one filter, specify a WHERE clause on the USE statement. For example, suppose that you want a matrix that contains the age, weight, and height of all females in the Sashelp.Class data. The following program reads the female observations into the matrix X and prints the average age, weight, and height:

proc iml;
varNames = {"Age" "Weight" "Height"};
use Sashelp.Class where(sex='F');       /* restrict to females */
   read all var varNames into X;
close Sashelp.Class;
 
avg = mean(X);
print avg[L="Mean Values for Females" colname=varNames format=5.2];
whereiml1

The WHERE clause in the READ statement

You can also put the WHERE clause in the READ statement. This technique is useful if you intend to read the data several times. For example, the following program reads data for females into the X matrix and data for males into the Y matrix:

use Sashelp.Class;
   read all var varNames into X where(sex='F');
   read all var varNames into Y where(sex='M');
close Sashelp.Class;

If you use a WHERE clause on both the USE and READ statements, the SAS log will include the NOTE
NOTE: WHERE clause has been augmented
to inform you that the data filter combines both WHERE clauses by using a "logical AND" operator.

Expressions in the WHERE clause

Beginning with SAS/IML 13.1 (released with SAS 9.4m1), you can use expressions in WHERE clauses. This means that you can call the READ statement in loop. During each iteration, you can read and analyze various subsets of the data during each iteration.

For example, suppose that you have several grouping variables and you want to conduct a BY-group analysis. You can use the UNIQUEBY technique to conduct a BY-group analysis with several variables. However, the UNIQUEBY technique requires that the data be sorted and fit in RAM. It also requires a bit of "bookkeeping" because you need to keep track of indices. If you don't mind the inefficiency of reading the data multiple times, a WHERE clause approach is conceptually easier to program.

As an example, suppose that you want to analyze the MPG_City variable in the Sashelp.Cars data set for each combinations of the Origin and Type variables. To keep it simple, suppose that you want to compute the mean value of MPG_City for all pairwise combinations of the Origin and Type variables, excluding the observations for American-made vehicles. This analysis is simple by using PROC MEANS. (The output for PROC MEANS is not shown.)

proc means data=Sashelp.Cars Mean;
where Origin ^= "USA";
   class Origin Type;
   var MPG_City;
run;

In PROC IML, this computation requires looping over the valid combinations of Origin and Type. To make the analysis simpler, the following call to PROC FREQ writes the valid combinations to a SAS data set:

proc freq data=Sashelp.Cars noprint;
where Origin ^= "USA";
tables Origin*Type / nocum norow nocol nopercent
                     out=FreqOut;  /* unique combinations of Origin and Type */
run;

In PROC IML, you can read the FreqOut data to obtain the unique combinations of the Origin and Type variables. You can iterate over these combinations, reading the Sashelp.Cars data multiple times. During each iteration, you can analyze one of the BY groups, as follows:

proc iml;
use FreqOut;
   read all var {Origin Type};           /* read unique levels of BY groups */
close FreqOut;
NumGroups = nrow(Origin);
 
use Sashelp.Cars where(Origin ^= "USA"); /* open data set for reading */
Stats = j(NumGroups, 2);                 /* allocate vector for results */
do i = 1 to NumGroups;                   /* for each BY group... */
   /* read unsorted data to obtain the i_th BY group */
   /* Notice the EXPRESSIONS in the WHERE clause! */
   read all var {MPG_City} where(origin=(Origin[i]) & type=(Type[i])); 
   Stats[i, 1] = countn(MPG_City);       /* number of nonmissing obs */
   Stats[i, 2] = mean(MPG_City);         /* analyze this BY group */
end;
close Sashelp.Cars;
 
print Origin Type Stats[colname={"N" "Mean"}];
whereiml2

The result of the analysis is similar to the output from PROC MEANS. Notice the use of expressions in the WHERE clause in the READ statement. The expression origin=(Origin[i]) is interpreted as follows:

  • The left side of the equal sign (origin) specifies the name of a variable in the open data set.
  • The right side of the equal sign must be enclosed in parentheses unless it is a literal constant.
  • The expression inside the parentheses can be any matrix computation that results in a scalar value, including calls to built-in or user-defined functions.

The example program reads the data set 10 times, once for each unique combination of Origin and Type. Although re-reading data is inefficient, there are three advantages: the data set does not need to be sorted, only one BY group at a time is ever in RAM, and the program statements are easy to write. By using this method, you do not have to keep track of sorting, indexing, or extracting the data. The WHERE clause in SAS/IML does the work for you.

Post a Comment

Generate points uniformly inside a circular region in 2-D

It is easy to generate random points that are uniformly distributed inside a rectangle. You simply generate independent random uniform values for each coordinate. However, nonrectangular regions are more complicated. An instructive example is to simulate points uniformly inside the ball with a given radius. The two-dimensional case is to generate random points uniformly inside the planar region that contains points within a certain distance to the origin. Mathematicians call this region the two-dimensional ball or disk; others just say "inside a circle."

To generate data uniformly in any planar region, you can use the acceptance-rejection algorithm: You generate points uniformly in a bounding rectangle and then reject and discard any values that are outside the region of interest. This method always works, but it is inefficient in high dimensions (and for regions with small volumes) because most points that you generate get rejected. This article shows how to directly generate points with uniform density inside a circular region.

How NOT to simulate data in a disk

If you are familiar with polar coordinates, you might think that this problem is simple. The function
(r, θ) → (r sin θ, r cos θ)
parameterizes a circular region in terms of radii and angles. As a first attempt, you might try to simulate uniform values of a radius and angle, and then use the polar equations to convert to Euclidean coordinates, as shown in the following SAS DATA step:

/* This DATA step is WRONG; the points are not uniform */
data Ball1(drop=radius twopi);
call streaminit(12345);
radius = 2;                           /* use circle of radius 2 */
twopi = 2 * constant("PI");
do i = 1 to 1000;                     /* simulate 1000 points */
   theta = twopi * rand("uniform");   /* angle is uniform */
   r = radius* rand("uniform");       /* radius is uniform : THIS IS WRONG! */
   x = r*cos(theta);
   y = r*sin(theta);
   output;
end;
run;

Unfortunately, this simple approach is not correct. The points are uniformly distributed in the (r, θ) coordinates, but not in the (x, y) coordinates. The following scatter plot displays the (x,y) points and the circle of radius 2.

Random point in disk

It is apparent from the image that there are more points near the origin than further away from the origin. The points are not uniformly distributed.

After seeing this image, it's not hard to realize the mistake in the simulation. When we say that points should be "uniformly distributed," we mean that the probability of generating a point in any finite region is proportional to the area of that region. However, the previous program does not consider area, only angles and radii.

The area of a circle with radius r is πr2, which is proportional to the square of the radius. However, the program generates points inside a circle of radius r with a probability that is proportional to r. To be correct, the probability should be proportional to the square of r.

How to simulate data in a disk

You can adjust the probability by changing the distribution of the r variable. Instead of sampling r uniformly, sample according to the square root of the uniform density (Devroye, 1986, p. 236). This density decreases the likelihood of points being generated near the origin and increases the probability that points will be generated near the edge of the circular region. Only one statement in the DATA step needs to change. The following SAS DATA step is the correct way to simulate points uniformly in a disk:

/* correct way to generate points uniformly in a 2-D circular region (ball) */
data Ball(drop=radius twopi);
call streaminit(12345);
radius = 2;                                /* use circle of radius 2 */
twopi = 2 * constant("PI");
do i = 1 to 1000;                          /* simulate 1000 points */
   theta = twopi * rand("uniform");        /* angle is uniform */
   r = radius * sqrt( rand("uniform") );   /* radius proportional to sqrt(U), U~U(0,1) */
   x = r*cos(theta);
   y = r*sin(theta);
   output;
end;
run;
Random points distributed uniformly in a disk

The new graph of the simulation shows that the points are uniformly distributed within the circular region. You can use PROC KDE to verify that the density inside the circle is approximately constant and equal to 1/(4π), which is 1 divided by the area of the circular region.

In a future post, I will show how to generalize this example and simulate random points uniformly within a d-dimensional ball for any d ≥ 2.

Post a Comment

Save descriptive statistics for multiple variables in a SAS data set

Descriptive univariate statistics are the foundation of data analysis. Before you create a statistical model for new data, you should examine descriptive univariate statistics such as the mean, standard deviation, quantiles, and the number of nonmissing observations.

In SAS, there is an easy way to create a data set that contains the descriptive statistics for every numerical variable in your data: use the OUTTABLE= option in PROC UNIVARIATE. It doesn't matter if your data has 5 variables or 5,000 variables. That one option writes dozens of statistics for all numerical variables in the data!

PROC MEANS: A great way to display statistics on the screen

First, some basic background about viewing descriptive statistics. Many people learned that PROC MEANS is the standard way to display descriptive statistics on your computer screen. You specify the statistics that you want on the PROC MEANS statement, as follows:

/* can produce 38 descriptive statistics */ 
proc means data=sashelp.cars nolabels
           N Mean Std Min Q1 Median Q3 Max;   /* specify the statistics */
run;       /* omit VAR statement to analyze all numerical variables */
descstatuni1

Notice that by omitting the VAR statement, the procedure analyzes all numerical variables. The output is an ODS table named "Summary."

That was easy. However, suppose that you do not care about seeing the statistics on your computer monitor, but you want them written to a SAS data set. Because you are analyzing many variables and want many statistics, it is somewhat awkward to use the OUTPUT statement in PROC MEANS. An easier way is to use ODS to write the statistics to a data set by doing the following:

  • Use the ODS EXCLUDE statement to suppress output to the monitor.
  • Specify the statistics that you want to see: the mean, standard deviation, median, and so forth.
  • Specify the STACKODSOUTPUT option on the PROC MEANS statement to tell the procedure to preserve the tabular form of the statistics table.
  • Use the ODS OUTPUT statement to write the "Summary" table to a SAS data set.

The solution now requires several lines of typing. It's not onerous, but it is a lot to remember:

ods exclude all;                  /* suppress display to open ODS destinations */
proc means data=sashelp.cars 
           N Mean Std Min Q1 MEDIAN Q3 Max  /* type every statistic you want */
           STACKODSOUTPUT;        /* preserve table form of output */
ods output Summary=MeansSummary;  /* write statistics to data set */
run;
ods exclude none;                 /* restore normal display of tables */

The data set MeansSummary contains the statistics for every numerical variable in the original data. This solution requires using ODS statements to direct the output to a data set while suppressing it from appearing on the screen. It is a nice trick in general, but there is an easier way for this particular task.

PROC UNIVARIATE and the OUTTABLE= option

I've been using SAS for 20 years, but somehow I never learned about the OUTTABLE= option in PROC UNIVARIATE until last week. While reading the documentation I learned that the OUTTABLE= option creates an output data set that contains 46 descriptive statistics for each numerical variable on the VAR statement.

If you omit the VAR statement, PROC UNIVARIATE analyzes all numerical variables. Therefore, by specifying the OUTTABLE= option, you can duplicate the PROC MEANS example without using ODS. As an extra bonus, you do not have to type the names of any statistics.

The following statement computes 46 statistics for each numerical variable in the input data set. The data set UniSummary contains the results. The NORMAL option tells PROC UNIVARIATE to include the Kolmogorov-Smirnov test for normality in the statistics that it creates:

proc univariate data=sashelp.cars outtable=UniSummary NORMAL noprint;
run;            /* omit VAR statement to analyze all numerical variables */
 
proc contents data=UniSummary short;
run;
descstatuni2

By default, PROC UNIVARIATE produces about two pages of tables for each numerical variable, but the NOPRINT option suppresses the display of tables. The output from PROC CONTENTS shows the names of variables in the output data set. Each name represents a statistic, such as CSS (cumulative sum of squares), CV (coefficient of variation), and GEOMEAN (geometric mean). The names begin and end with an underscore. The following statements print the same summary statistics that PROC MEANS produced; the output is not shown.

proc print data=UniSummary label;
   var _var_ _label_ _NObs_ _Mean_ _Std_ _Min_ _Max_;
run;

For both procedures, you can use the CLASS statement to obtain statistics for subgroups of the data. For example, you can include the statement CLASS origin to obtain summary statistics for each variable and grouped according to whether a vehicle was manufactured in Europe, Asia, or the USA.

It is worth noting that PROC UNIVARIATE has options (for example, WINSORIZED= and TRIMMED=) that compute statistics that are not saved to the OUTTABLE= data set. However, you can use ODS to capture those statistics.

In summary, this article shows how to create a data set for descriptive statistics by using PROC MEANS or PROC UNIVARIATE. You can use either procedure, but using the OUTTABLE= option in PROC UNIVARIATE saves you some typing. The OUTTABLE= data set also includes some robust statistics that are not available in PROC MEANS.

Post a Comment