Lo, how a polar rose e'er blooming

Lo how a rose e'er blooming
From tender stem hath sprung

As I write this blog post, a radio station is playing Chrismas music. One of my favorite Christmas songs is the old German hymn that many of us know as "Lo, How a Rose E're Blooming." I was humming the song recently when someone asked me how to use SAS to draw curves in polar coordinates. I immediately thought about "polar roses," mathematical parametric curves of the form r = cos(k θ).

In the spirit of Christmas, I present "Lo, how a polar rose e'er blooming, from SGPLOT hath sprung."

Plotting polar coordinates in SAS

It is easy to graph a polar curve in SAS. Here's a quick overview, or a "polar express" tour, if you prefer.

Many familiar polar equations are parametric curves. The simplest are relations are of the form r = f(θ), where θ is the polar angle. The following SAS DATA step creates evenly spaced values of theta in the range [0, 2π]. For each value of theta, it computes r = cos(k θ), which is the equation for a rose with k petals. The polar coordinates (r, θ) are converted to Euclidean coordinates through the usual transformation x = r*cos(theta) and y = r*sin(theta).

After creating the points along the curve, you can visualize it by using PROC SGPLOT. The SERIES statement is used to display the parametric curve in Euclidean coordinates. The XAXIS and YAXIS statements use the MIN= and MAX= options to ensure that the image appears in a square region, and the ASPECT=1 option is used to ensure that the aspect ratio of the plot does not distort the geometry of the rose.

%let k = 5;
/* draw the curve r=cos(k * theta) in polar coords, which is a k-leaved rose */
data Rose;
do theta = 0 to 2*constant("pi") by 0.01;
   r = cos(&k * theta);       /* rose */
   x = r*cos(theta);          /* convert to Euclidean coords */
   y = r*sin(theta);
   output;
end;
run;
 
title "Polar Rose: r = cos(&k theta)";;
proc sgplot data=Rose aspect=1;
   refline 0 / axis=x;
   refline 0 / axis=y;
   series x=x y=y;
   xaxis min=-1 max=1;
   yaxis min=-1 max=1;
run;
polarrose

Graphing generalized roses

The Wikipedia page for the polar rose contains a brief discussion about what the curve r = cos(k θ) looks like when k is a rational number n/d. When k is not an integer, most of the curves do not look like a rose—or any other kind of flower! Only a myopic mathematician would call some of these images roses.

Regardless, let's see how to get some of these "roses" to bloom in SAS. The following DATA step generates all positive rational numbers of the form n/d, where n ≤ 7 and d ≤ 9 are integers. For each value of k, the program computes the generalize rose equation and converts the values from polar to Euclidean coordinates. (Because I want to color the roses red and green for Christmas, I will also include a binary variable that will determine the color.)

data Roses;
do n = 1 to 7;
   do d = 1 to 9;
      k = n / d;
      /* draw the rose r=cos(n/d * theta) */
      group = mod(n+d, 2);        /* optional: use for color */
      do theta = 0 to 2*lcm(n,d)*constant("pi") by 0.1;
         r = cos(k * theta);      /* generalized rose */
         x = r*cos(theta);        /* convert to Euclidean coords */
         y = r*sin(theta);
         output;
      end;
   end;
end;
run;

It is not any harder to visualize 63 roses than it is to visualize one. The SGPANEL procedures was designed to display multiple cells, each with the same type of graph. Therefore, to display an entire bouquet of roses, you merely have to specify the N and D variables on the PANELBY statement, and SAS handles the rest:

title "Polar Roses: r = cos(n/d theta)";
proc sgpanel data=Roses aspect=1 noautolegend;
   styleattrs datacontrastcolors=(crimson green);
   panelby n d / layout=lattice onepanel rowheaderpos=left;
   refline 0 / axis=x transparency=0.5;
   refline 0 / axis=y transparency=0.5;
   series x=x y=y / group=group;
   colaxis display=none;
   rowaxis display=none;
run;
polarrose2

You can see certain numeric patterns in the lattice display. In particular, when n/d is a reducible fraction (such as 2/4, 3/6, and 4/8), the image is identical to the reduced fraction (such as 1/2). Do you see other patterns? Leave a comment.

These roses are beautiful, like the song that inspired me to create them. Since I get up early in the morning to write my blog, a few additional lyrics seem appropriate:

It came a blossom bright.
Amid the cold of winter
When half spent was the night.

Happy holidays to all my readers. I am taking a short vacation and will return in 2016. Until then, keep your eyes open: math, statistics, and beauty are all around you, ever blooming!

Post a Comment

Running interactive procedures in SAS Studio and SAS University Edition

The most recent development environment for SAS programmers is SAS Studio, which is a browser-based application. The free SAS University Edition, which includes SAS/IML software, also uses SAS Studio as a development environment.

SAS Studio has a special mode for programmers who use interactive procedures such as PROC IML. (Recall that the "I" in IML stands for "interactive.") An interactive procedure in SAS enables you to submit multiple statements without exiting the procedure. For example, if you submit SAS/IML statements that define some vectors, those vectors persist and are available for the subsequent submissions until you submit the QUIT statement. (Remember: SAS/IML does not require a RUN statement to execute the submitted statements.) Other fully interactive procedures include PROC SQL and PROC OPTMODEL.

Some SAS procedures are partly interactive. Procedures such as PROC REG and PROC GLM support RUN-group processing. For these procedures, the RUN statement defines blocks of statements that get executed, but the procedure does not exit until it encounters a QUIT statement.

Setting interactive mode in SAS Studio

It is easy to put SAS Studio into interactive mode, which supports both fully and partially interactive procedures. Towards the right side of the main row of icons is an icon whose tool tip is "Go interactive." The icon is shown below:

gointeractive

When you click that icon, SAS Studio enters "interactive mode." In interactive mode, SAS Studio does not automatically submit a QUIT statement after each submission. This is very important to programmers who use interactive procedures. In the past, I have been unable to use SAS Enterprise Guide because EG does not have an interactive mode. Prior to SAS 9.4m1 when SAS Studio implemented interactive mode, I mostly used the SAS Windowing environment (display manager) to develop complex SAS programs that include calls to PROC IML. Now I have the option to use SAS Studio.

Setting interactive mode to be the default mode

You can set interactive mode to be the default mode in SAS Studio. Click on the Options icon and select Preferences from the drop-down menu. The Options icon looks like the following:

gointeractive2

When you click the Options icon, the Preferences dialog box appears. On the General tab, check the Start new programs in interactive mode field. Click Save. Future code windows will be in interactive mode by default.

gointeractive3

As a bonus, in interactive mode SAS Studio appends output to the SAS log and to the results window. Iprefer this behavior over the default behavior, which is to clear the output and log each time that you submit new statements.

As I've said before, I am excited that SAS/IML is included in the free SAS University Edition. I am also excited that the SAS Studio interface provides an easy way to enable interactive mode, so that you can enjoy using the interactive capabilities of PROC IML and other interactive procedures.

Post a Comment

Overlay categories on a histogram

Recently Sanjay Matange blogged about how to color the bars of a histogram according to a gradient color ramp. Using the fact that bar charts and histograms look similar, he showed how to use PROC SGPLOT in SAS to plot a bar chart in which each bar is colored according to a gradient color ramp. Essentially, Sanjay used a bar chart to simulate a histogram.

histblock1

This article discusses a different, but related, issue: overlay a sequence of discrete colored blocks on a histogram. The typical use case is to display categories that are derived from the histogram's variable.

For example, the histogram at the left shows the distribution of cholesterol levels for more than 5,000 patients in the Framingham Heart Study, as recorded in the data set Sashelp.heart. Behind the histogram are three colored regions that indicate three medically significant intervals: desirable levels of cholesterol (less than 200 mg/dL), borderline levels (up to 240 mg/dL), and high levels (240 mg/dL or greater). A sequence of colored blocks (green, yellow, and red) are drawn to visually indicate the cholesterol ranges that are of interest.

This same technique applies to many other examples. For example, you can graph the distribution of student grades and display the intervals for the categories for "A," "B", "C," "D," and "F." Or you can plot the wind speeds of a tropical cyclone and display the Saffir-Simpson categories for wind speed such as "Tropical Depression," "Tropical Storm", "Category 1," and so forth.

Visualizing discrete categories

How can you create a histogram with colored blocks? In SAS, you can use the BLOCK statement in PROC SGPLOT or use the BLOCKPLOT statement in the SAS Graph Template Language (GTL) to create the colored regions. Unfortunately, PROC SGPLOT does not allow you to overlay a histogram and a block plot, so this article uses the GTL.

To create a block plot you need two variables: the continuous variable (Cholesterol) and a discrete variable that associates each cholesterol value with a category. The Sashelp.Heart data contains a discrete variable named Chol_Status which has the value "Desirable" if a patient's cholesterol is less than 200 mg/dL. It has the value "Borderline" if the cholesterol is 200 or more, but less than 240 mg/dL. It has the value "High" if the patient's cholesterol is 240 or more. You can create a block plot by using Cholesterol as the X variable and by using Chol_Status as the "Block" variable. In order to create a block plot in SAS, the data must be sorted in ascending order by the continuous variable (Cholesterol).

Notice that I am not attempting to color the histogram bars. In general, the cut points that determine the categories are not evenly spaced, so it is not always possibly to align the cut points with the endpoints of the histogram bins.

I used this technique in Wicklin (2009) to visualize the distribution of a response variable that is used to color a heat map or choropleth map. In that paper, quantiles of a response variable are used to color a heat map. Each heat map is accompanied by a histogram of a response variable and a colored bar that shows quantiles of the response. I did not originate the idea; it was used very effectively in Pickle et al. (1996) Atlas of United States Mortality.

Overlay the histogram and block plot

The easiest way to overlay a histogram and block plot is to use the LAYOUT OVERLAY statement in the GTL. If you do not specify any colors, then each regions is colored by using the GraphData1, GraphData2,..., colors of the current ODS style. In most cases you will want to specify the colors by using a sequential or diverging palette of colors. For this example, I have hard-coded the colors to be a traffic-light scheme, which many people use to convey the idea that low values of cholesterol are good (green), intermediate values should cause an alert (yellow), and high values are dangerous (red).

The following GTL statements create a template called BlockHist. The template uses dynamic variables so that at run time you can specify the title and the names of the continuous and block variables. The height of the vertical axis is extended by 10% to make room for the labels of the block variable. If you use this template to display k categories, you should replace the DATACOLORS= option with a color palette that has k colors.

/* overlay histogram on block plot */
proc template;
define statgraph BlockHist;                   /* name of template */
dynamic _X _Block _Title;                     /* dynamic variables */
  begingraph / datacolors=(CX91CF60 CXFEE08B CXD73027);  /* green, yellow, red for this example */
  entrytitle _Title;                          /* specify title at run time (optional) */
  layout overlay / yaxisopts=(offsetmax=0.1); /* add 10% space at top for labels */
     blockplot x=_X block=_Block / display=(fill values)   /* categories */
              valuehalign=center valuevalign=top;
     histogram _X;                       /* optional: fillattrs=(color=white transparency=0.25) */
  endlayout;
endgraph;
end;
run;
 
/* sort data to use BLOCKPLOT statement */
proc sort data=sashelp.heart out=heartSorted; 
   by Cholesterol; 
run;
 
proc sgrender data=heartSorted template=BlockHist;
where Cholesterol <= 400;
   dynamic _X='Cholesterol' _Block='Chol_Status' _Title="Distribution of Cholesterol";
run;

The graph is shown at the top of this article.

An alternative design: Adding a colored strip

To my eye, the previous graph has too much color. The eye is drawn to the background rather than to the histogram of the data. An alternative design is to display a narrow strip of color underneath the histogram, which is the method used by Pickle et al. The following GTL template uses the INNERMARGIN statement to create a block plot underneath a histogram. Again, you should modify the DATACOLORS= option if your block variable has more than three colors.

/* put narrow strip that shows categories below the histogram */
proc template;
define statgraph BlockHistStrip;    /* name of template */
dynamic _X _Block _Title;           /* dynamic variables */
  begingraph / datacolors=(CX91CF60 CXFEE08B CXD73027);    /* green, yellow, red for this example */
  entrytitle _Title;                /* specify title at run time (optional) */
  layout overlay;
      histogram _X / binaxis=false;
      innermargin / pad=(top=2);   /* place BLOCKPLOT in the INNERMARGIN */
         blockplot x=_X block=_Block / display=(fill values)
                   valuehalign=center valuevalign=top;
      endinnermargin;
   endlayout;
   endgraph;
end;
run;
 
ods graphics / width=640px height=320px;
proc sgrender data=heartSorted template=BlockHistStrip;
where Cholesterol <= 400;
   dynamic _X='Cholesterol' _Block='Chol_Status' _Title="Distribution of Cholesterol";
run;
histblock2

In summary, you can use the BLOCKPLOT statement in the GTL to visualize categories that are derived from a continuous variable. You can create a block plot behind the histogram, or you can display a narrow strip of color below the histogram. In either case, the bars of the block plot indicate intervals of a continuous variable that define discrete medical or scientific categories. You can specify the colors of the block plot by using a sequential or diverging color palette, thereby linking the histogram variable to a heat map or choropleth map that is colored by discrete categories.

Post a Comment

Why doesn't PROC UNIVARIATE support certain common distributions?

A SAS customer asked:

Why isn't the chi-square distribution supported in PROC UNIVARIATE?

That is an excellent question. I remember asking a similar question when I first started learning SAS. In addition to the chi-square distribution, I wondered why the UNIVARIATE procedure does not support the F distribution. These are common distributions that appear often in statistics textbooks. Why aren't they among the list of distributions that PROC UNIVARIATE can fit?

There are three ways to address the question. One is philosophical, the other graphical, the third is computational. (Spoiler Alert: PROC UNIVARIATE actually does support fitting the chi-square distribution!)

Data distributions versus sampling distributions

The HISTOGRAM statement in UNIVARIATE procedure can fit many continuous parametric distributions to observed data. It supports standard distributions such as the exponential, lognormal, normal, and Weibull distributions, as well as some less-common distributions such as the bounded and unbounded Johnson distributions. Other SAS procedures can fit additional distributions. So why doesn't UNIVARIATE support the chi-square and F distributions?

The philosophical answer is that PROC UNIVARIATE is designed to model distributions of observed data, whereas the chi-square and F distributions arise naturally as the sampling distribution of a statistic.

When you model real data (size, weight, pressure, time-to-failure, and so forth), you should choose a modeling distribution that plausibly describes the population from which a random sample was drawn. This often leads to models such as the normal, lognormal, Weibull, and so forth. It usually doesn't make sense to say, "I think I'll model these data by using a chi-square distribution." The chi-square and F distributions describe (asymptotic) sampling distributions of statistics. We don't usually fit their parameters; the parameters represent degrees of freedom and are determined by the sample size, the number of groups, and similar considerations.

Consequently, the philosophical answer is that PROC UNIVARIATE does not support these distributions because they are not common models for observed data.

Overlaying a distribution on a histogram

There is another way to interpret the question. Many statistical programmers use PROC UNIVARIATE as an easy way to overlay a parametric density curve on a histogram. Therefore the programmer might be asking the practical question, "how can I overlay a chi-square (or F) distribution on a histogram." For example, in a simulation study, you might want to overlay an asymptotic density curve on a histogram of Monte Carlo estimates.

I have previously described how to overlay a custom density curve on a histogram. All that is required is the ability to evaluate the probability density function (PDF) for the distribution. The PDF function in SAS can evaluate the chi-square and F distributions, so it is straightforward to overlay these distributions on a histogram.

Fitting a chi-square distribution to data

There is a third answer to the question. The chi-square distribution with d degrees of freedom is equivalent to a gamma(d/2, 2) distribution. PROC UNIVARIATE supports fitting the gamma distribution, so you can actually fit a chi-square model as a special case of the gamma distribution. As mentioned in the first section, you probably wouldn't do this for real data, but you might do it as part of a simulation study.

The following program simulates data from a chi-square distribution with 5 degrees of freedom, then fits a gamma density to the data. Use the SCALE=2 option to restrict the gamma family to the chi-square family. Use SHAPE=EST to compute a maximum likelihood estimate of the shape parameter, as follows:

data ChiSq(keep=xChi2);
call streaminit(54321);
do i = 1 to 100;
   xChi2 = rand("chisq", 5);  /* 5 degrees of freedom */
   output;
end;
run;
 
/* use fact that chiSquare(d) = gamma(d/2, 2) to fit chi-square to data */
proc univariate data=ChiSq;
   histogram xChi2 / gamma(shape=est scale=2);  /* fix scale=2, estimate shape parameter */
run;
chi2fit

The estimate of the shape parameter is 2.59. Double this value to obtain an estimate for the degree-of-freedom parameter, which results in the estimate d = 5.18. This is close to the parameter value (5) that was used to simulate the data.

By the way, you can use a similar trick to fit a uniform distribution by using PROC UNIVARIATE. The uniform distribution is a sub-family of the beta distribution, with ALPHA=1 and BETA=1.

As far as I know, there is no way to use PROC UNIVARIATE to fit the F distribution. In theory, you can use maximum likelihood estimation in SAS/IML or some other SAS procedure to obtain estimates for the degree-of-freedom parameters. However, I have never done that, so I don't know how hard it is. If anyone has a reference (or program!) to fit an F distribution to data, post a comment.

In summary, there are valid philosophical reasons why PROC UNIVARIATE doesn't support fitting certain distribution to observed data. However, with a little technical know-how, you can actually trick UNIVARIATE into fitting a chi-square distribution. But if your goal is to overlay a density curve on a histogram, you can do that without using PROC UNIVARIATE.

Post a Comment

Arrange matrices and graphs in a gridded layout

Last week my colleague Chris Hemedinger published a blog post that described how to use the ODS LAYOUT GRIDDED statement to arrange tables and graphs in a panel. The statement was introduced in SAS 9.4m1 (December 2013). Gridded layout is supported for HTML, POWERPOINT, and the PRINTER family of destinations (PDF, PS, and PCL). It is not supported for other destinations such as RTF or LISTING.

Chris's example arranged the output of ODS objects that were created by different procedures. It turns out that you can also use the ODS LAYOUT GRIDDED statement to arrange tables and graphs that are created by PROC IML, which is an interactive procedure. Because the ODS statement is a "global" statement in SAS, you can call it from within a SAS/IML program to display output horizontally across a page.

For example, suppose that you have two matrices of different sizes, but you want to display them side by side. If you use the PRINT statement, the output doesn't look very pretty in HTML. The PRINT statement essentially makes one big table to hold the two matrices. There is no space between the matrices, and it is hard to tell where one matrix ends and the other begins. The shorter matrix is padded with blank rows, as shown by the following output:
proc iml;
A = {A B C, D E F};
B = {1 2, 3 4, 5 6, 7 8};
print A B;               /* print matrices side by side */
t_layoutgridded1

The output has spaces in the LISTING destination, but I no longer use the LISTING destination, so I am interested in improving the look in HTML and other modern destinations.

The ODS LAYOUT GRIDDED statement solves this problem. You simply specify that ODS should display subsequent tables in a gridded configuration with two columns. You can use the ODS REGION statement every time you want to subsequent output to appear in a new column:
/* gridded layout in HTML, POWERPOINT, and PDF */
ods layout gridded columns=2;
ods region;      /* display subsequent output in 1st column */
   print A;
ods region;      /* display subsequent output in 2nd column */
   print B;
ods layout end;
t_layoutgridded2

The HTML output is shown. The second matrix is positioned horizontally next to the first matrix. Notice that two PRINT statements are used. You can control the horizontal space between matrices by using the COLUMN_GUTTER= option on the ODS LAYOUT GRIDDED statement.

Actually, you can make it even simpler. You can use the ADVANCE=TABLE option on the ODS LAYOUT GRIDDED statement to specify that each table should automatically trigger an advancement to the next column. Consequently, you can omit the ODS REGION calls:

ods layout gridded columns=2 advance=table;
   print A;      /* output in 1st column */
   print B;      /* output in 2nd column */
ods layout end;

The same technique works for horizontally arranging graphical and tabular output. The following statements place a table next to a histogram. If you specify a TITLE statement, put it before the ODS LAYOUT GRIDDED statement.

call randseed(12345);
x = j(100,1);
call randgen(x, "Normal");
/* compute descriptive statistics for sample */
N = countn(x);
mean = mean(x);
std = std(x);
call qntl(q, x, {0.05 0.95});
desc = N // mean // std // q;
statNames = {N, "Mean", "StdDev", "P5", "P95"};
 
ods graphics / width=4in height=2in;
ods layout gridded columns=2 advance=table;
   call histogram(x);
   print desc[L="Descriptive Statistics" r=statNames];
ods layout end;
t_layoutgridded3

As these examples demonstrate, the ODS LAYOUT GRIDDED statement is a powerful way to for a SAS/IML programmer to position output. Although it is not supported by all ODS destinations, it enables you to dynamically arrange output horizontally across the page in HTML and PDF destinations.

Can you think of imaginative ways to use the ODS LAYOUT GRIDDED statement to improve your SAS/IML output? Leave a comment.

Post a Comment

Size matters: Preserving the aspect ratio of the data in ODS graphics

When creating a statistical graphic such as a line plot or a scatter plot, it is sometimes important to preserve the aspect ratio of the data. For example, if the ranges of the X and Y variables are equal, it can be useful to display the data in a square region. This is important when you want to visualize the distance between points, as in certain multivariate statistics, or to visually compare variances. It is also important if you are plotting polygons and want a square to look like a square.

This article presents two ways to create ODS statistical graphics in SAS in which the scale of the data is preserved by the plot. You can use the ASPECT= option in PROC SGPLOT and the OVERLAYEQUATED layout in the Graph Template Language (GTL).

Data scale versus physical measurements of a graph

Usually the width and height of a graph does not depend on the data. By default, the size of an ODS statistical graphic in SAS is a certain number of pixels on the screen or a certain number of centimeters for graphs written to other ODS destinations (such as PDF or RTF). When you request a scatter plot, the minimum and maximum value of each coordinate is used to determine the range of the axes. However, the physical dimensions of the axes (in pixels or centimeters) depends on the titles, labels, tick marks, legends, margins, font sizes, and many other features.

For example, the following data has two variables. The X and Y variables both have a minimum value of 0 and a maximum value of 1. Therefore the range of each variable is 1. The default graph has a 4:3 ratio of width to height, so when you create a scatter plot, the physical lengths (in pixels) of the X and Y axes are not equal:

data S;                   /* XRange = YRange = [0, 1] */
input x y @@;
datalines;
-1   -1     -0.75 -0.5   -0.5 -0.75  -0.25 0   0 0.5
0.25 -0.25   0.5   0.75   0.75 0.25   1    1
;
 
ods graphics / reset;      /* use default width and height */
title "Default Graph: 640 x 480 pixels";
title2 "Aspect Ratio 4:3";
proc sgplot data=S;
   scatter x=x y=y;
   xaxis grid; 
   yaxis grid;
run;
aspectratio1

You can click on the graph to see the original size. The graph area occupies 640 x 480 pixels. However, because of labels and titles and such, the region that contains the data (also called the wall area) is about 555 pixels wide and 388 pixels tall, which is obviously not square. You can see that each cell in the grid represents a square with side length 0.5, but the cells do not appear square on the screen because of the aspect ratio of the graph.

Setting the aspect ratio

Prior to SAS 9.4, PROC SGPLOT did not provide an easy way to set the aspect ratio of the wall area. You had to use trial and error to adjust the width of the graph until the wall area looked approximately square. For example, you could start the process by submitting ODS GRAPHICS / WIDTH=400px HEIGHT=400px;.

However, in SAS 9.4 you can use the ASPECT= option on the PROC SGPLOT statement to tell PROC SGPLOT to make the wall area (data region) square, as follows:

title "Graph: 640 x 480 pixels";
title2 "Aspect Ratio 1:1";
proc sgplot data=S aspect=1;     /* set physical dimensions of axes equal */
   scatter x=x y=y;
   xaxis grid; 
   yaxis grid;
run;
aspectratio2

Although the graph size has not changed, the wall area (which contains the data) is now square. The wall area is approximately 370 pixels in both directions.

Notice that graph has a lot of white space to the left and right of the wall area. You can adjust the width of the graph to get rid of the extra space.

This technique also works for other aspect ratios. For example, if the range of the Y variable is 2, you can use ASPECT=2 to set the wall area to be twice as high as it is wide.

The wall area is square because the range of the X variable equals the range of the Y variable, and the margins in the wall area (set by using the OFFSETMIN= and OFFSETMAX= options) are also equal. If your X and Y ranges are not exactly equal, read on.

Setting the range of the axes

In practice, the range of the X axis might not exactly equal the range of the Y axis. In that case, you can use the MIN= and MAX= options on the XAXIS and YAXIS statements to set the ranges of each variable to a common range. For example, in principal component analysis, the principal component scores are often plotted on a common scale. The following call to PROC PRINCOMP creates variables PRIN1, PRIN2, and PRIN3 that contain the principal component scores for numerical variables in the Sashelp.Iris data set:

proc princomp data=Sashelp.Iris N=3 out=OutPCA noprint;
   var SepalWidth SepalLength PetalWidth PetalLength;
run;
 
proc means data=OutPCA N min max mean std;
   var Prin:;
run;
t_aspectratio

You can see that the range of the three variables are not equal. However, you can use the ASPECT=1 option to display the scores so that one unit in the horizontal direction is the same number of centimeters as one unit in the vertical direction. The MIN= and MAX= options are used so that the ranges of the X and Y variables are equal:

ods graphics / width=480px height=480px;
title "Principal Component Scores";
title2 "Aspect Ratio 1:1";
proc sgplot data=OutPCA aspect=1;
   scatter x=Prin1 y=Prin2 / group=Species;
   xaxis grid min=-2.8 max=3.3; /* values=(-3 to 3) valueshint; */
   yaxis grid min=-2.8 max=3.3; /* values=(-3 to 3) valueshint; */
run;
aspectratio3

In spite of titles, legends, and labels, the wall area is a square. The width of the graph was reduced so that there is less blank space to the left and right of the wall area.

Notice the comments in the call to PROC SGPLOT. The comments indicate how you can explicitly set values for the axes, if necessary. You can use the VALUES= option to set the tick values. You can use the VALUESHINT option to tell PROC SGPLOT that these values are merely "hints": the tick values should not be used to extend the length of an axes beyond the range of the data.

Automating the process with GTL

I like PROC SGPLOT, but if you are running a version of SAS prior to 9.4, you can still obtain equated axes by using the GTL and PROC RENDER. The trick is to use the OVERLAYEQUATED layout, rather than the usual OVERLAY layout. The OVERLAYEQUATED layout ensures that the physical dimensions of the wall area is proportional to the aspect ratio of the data ranges. The following example uses the output from the PROC PRINCOMP analysis in the previous section:

proc template;                        /* scatter plot with equated axes */
define statgraph ScatterEquateTmplt;
dynamic _X _Y _Title;                 /* dynamic variables */
 begingraph;
 entrytitle _Title;                   /* specify title at run time (optional) */
  layout overlayequated /             /* units of x and y proportions as pixesl */
         xaxisopts=(griddisplay=on)   /* put X axis options here */
         yaxisopts=(griddisplay=on);  /* put Y axis options here */
    scatterplot x=_X y=_Y;            /* specify variables at run time */
  endlayout;
endgraph;
end;
run;
 
proc sgrender data=outPCA template=ScatterEquateTmplt; 
   dynamic _X='Prin1' _Y='Prin2' _Title="Equated Axes";
run;

The output is not shown, but is similar to the graph in the previous section. The nice thing about using the GTL is that it supports the EQUATETYPE= option, which enables you to specify how to handle axes ranges that are not equal.

In summary, there are two ways to make sure that the physical dimensions of data area (wall area) of a graph accurately represents distances in the data coordinate system. You can use the GTL and the OVERLAYEQUATED layout, as shown in this section, or you can use the ASPECT= option in PROC SGPLOT if you have SAS 9.4. Although it is not always necessary to equate the X and Y axis, SAS supports it when you need it.

Post a Comment

Extracting elements from a matrix: rows, columns, submatrices, and indices

A matrix is a convenient way to store an array of numbers. However, often you need to extract certain elements from a matrix. The SAS/IML language aupports two ways to extract elements: by using subscripts or by using indices. Use subscripts when you are extracting a rectangular portion of a matrix, such as a row, a column, or a submatrix. Use indices when you want to extract values from a non-rectangular pattern.

Extracting rows, columns, and submatrices

You can extract a submatrix by using subscripts to specify the rows and columns of a matrix. Use square brackets to specify subscripts. For example, if A is a SAS/IML matrix, the following are submatrices:

  • The expression A[2,1] is a scalar that is formed from the second row and the first column of A.
  • The expression A[2, ] specifies the second row of A. The column subscript is empty, which means “use all columns.”
  • The expression A[ , {1 3}] specifies the first and third columns of A. The row subscript is empty, which means “use all rows.”
  • The expression A[3:4, 1:2] specifies a 2 x 2 submatrix that contains the elements that are in the intersection of the third and fourth rows of A and the first and second columns of A.

The following SAS/IML statements demonstrate extracting submatrices:

proc iml;
A = shape(1:30, 5);     /* 5 x 6 matrix */
scalar = A[2,1];
row = A[2, ];           /* 2nd row */
cols = A[ , {3 1}];     /* 3rd and 1st columns */
matrix =  A[3:4, 1:2];  /* 2 x 2 matrix: (A[3,1] || A[3,2]) // 
                                         (A[4,1] || A[4,2]) */
print A, scalar, row, cols, matrix;

The previous examples were adapted from Wicklin (2013) "Getting Started with the SAS/IML Language", which I recommend for programmers who are starting to learn the SAS/IML language.

Extracting diagonals and triangular elements

Non-rectangular patterns are common in statistical programming. Examples include the matrix diagonal and the lower triangular portion of a square matrix. The SAS/IML provides special functions for extracting diagonal and triangular regions:

For example, the following statements extract the main diagonal, the lower triangular elements in row-major order, and the lower triangular elements in column-major order:

proc iml;
S = shape(1:16, 4);    /* 5 x 5 matrix */
v = vecdiag(S);
L_row = symsqr(S);
L_col = vech(S);
print S, v, L_row, L_col;

Extracting arbitrary patterns of elements

For the extraction of arbitrary elements, you should use indices. SAS/IML software stores matrices in row-major order, which means the elements are enumerated as you move across the first row, then across the second row, and so forth. However, notice that you do not know the subscripts for A[3] unless you know the shape of A. If A is a 3 x 3 matrix, A[3] corresponds to A[1,3]. However, if A is a 2 x 2 matrix, A[3] corresponds to A[2,1].

The SUB2NDX function enables you to convert subscript information into the equivalent indices. For example, suppose that B is a 5 x 5 matrix and you want to extract the following elements: B[5,2], B[2,4], B[4,3], B[3,1], and B[1,5]. The following statements convert the subscripts into indices and extract the elements:

proc iml;
B = shape(1:25, 5);   /* 5 x 5 matrix */
subscripts = {5 2,  2 4,  4 3,  3 1,  1 5};  /* five (row,col) subscripts */
ndx = sub2ndx(dimension(B), subscripts);
vals = B[ndx];
print vals;
t_extract

A powerful advantage of indices is that you can use them to assign values as well as to extract values. For example, if v is a five-element column vector, the expression B[ndx] = v assigns the values v to the elements of B. Notice that this is a vectorized operation. If you do not use indices, you would probably write a DO loop that iterates over the subscripts. In general, vector operations are more efficient than looping operations.

The ability to access elements in an arbitrary order is a big advantage for SAS/IML programmers. Whereas the DATA step processes one observation at a time, the SAS/IML language enables you to access SAS data in whatever order makes sense for the algorithm that you are writing.

Post a Comment

Determine whether a SAS product is licensed

Sometimes you are writing a program that needs to find out whether a particular SAS product (like SAS/ETS, SAS/QC, or SAS/OR) is licensed. I was reminded of this fact when I wrote last week's blog post about how to create a map with PROC SGPLOT. Although the SGPLOT procedure is included with Base SAS, the data set that I used (MAPS.US) is only available at sites that have licensed SAS/GRAPH software.

If you just want to find out what products are licensed at your site, but you don't need this information as part of a program, you can use the SETINIT procedure:

proc setinit; 
run;
Product expiration dates:
---Base SAS Software          26OCT2016
---SAS/STAT                   26OCT2016
---SAS/GRAPH                  26OCT2016
---SAS/ETS                    26OCT2016
---SAS/IML                    26OCT2016

The SAS log contains all the licensed products. However, this information is hard to use in a program, which is necessary if you are writing a program that you intend to distribute to many sites. (Perhaps you intend to post it on the SAS/IML File Exchange>)

You can use the SYSPROD function to determine at run time whether a SAS product is licensed. For example, the following SAS/IML program tests whether the SAS/GRAPH product is installed. If not, it aborts the program and prints an error message. If so, it reads the MAPS.USCity data and performs a computation:

proc iml;
if sysprod("graph")=0 then 
   ABORT "SAS/GRAPH is not licensed at this site";
else do;
   use maps.USCity;  read all var {City StateCode Pop};  close maps.USCity;
   City = strip(City);
   idx = Pop[<:>];        /* find index of city with largest population */
   print (City[idx])[L="Largest City"]
         (StateCode[idx])[L="State"]
         (Pop[idx])[L="Population" F=COMMA10.];
 
   idx = Pop[>:<];        /* find index of city with smallest population */
   print (City[idx])[L="Smallest City"]
         StateCode[idx])[L="State"]
         (Pop[idx])[L="Population" F=COMMA10.];
end;
quit;
t_sysprod

The output of the program is shown. The program uses the index maximum operator to discover that New York City has the largest population in the data set. Similarly, the index minimum operator is used to return one of the 17 "ghost towns" in the data set.

You can use the SYSPROD function in the DATA step and in SAS/IML. For SAS macro programs, you can use the %SYSPROD macro function. The documentation for %SYSPROD has an example program that tests whether SAS/GRAPH is installed.

By the way, for this example you could use the LIBREF function to check whether the MAPS libref exists. However, the SYSPROD function makes it clear that you are checking for the existing of a product, not the existence of a libref.

For a fun exercise, print out the complete set of ghost towns (WHERE POP=0) in the MAPS.USCity data set. Which states have the most ghost towns?

Post a Comment

Create a map with PROC SGPLOT

polygonmap2

Did you know that you can use the POLYGON statement in PROC SGPLOT to draw a map? The graph at the left shows the 48 contiguous states of the US, overlaid with markers that indicate the locations of major cities. The plot was created by using the POLYGON statement, which was introduced in SAS 9.4m1.

The POLYGON statement is a versatile and powerful statement. My colleague Sanjay Matange has shown how to use the POLYGON statement to create custom plots such as a variable-width bar chart. He has also shown how to use the POLYGONPLOT statement in the Graph Template Language (GTL) to create panels of maps, and he mentioned that the POLYGON statement in PROC SGPLOT acts similarly.

Identifying polygons

You need to look carefully at Sanjay's program to see how he constructs an ID variable that identifies each polygon. For map data, it is common for regions to be defined in a nested hierarchical manner. For example, the MAPS.US data set that is distributed with SAS/GRAPH software has two variables (STATE and SEGMENT) that together define the polygons that make up the map of US states. (For example, Michigan has two segments that correspond to Lower Peninsula and the Upper Peninsula.) Sanjay uses a standard trick that concatenate multiple variable values into a single ID variable, and uses the concatenated values to identify each polygon.

Because the MAPS.US data set is sorted by the STATE and SEGMENT, you can generate an ID variable without using concatenation. Instead, you can use BY-group processing to generate an ID variable for each polygonal segment. The following DATA step defines the PolyID variable and increments its value every time a new segment is encountered. The PolyID variable is specified on the ID= variable in the POLYGON statement in order to create a graph of the regions in the data set:

data US;
   set maps.US;
   where StateCode NOT IN ("AK", "HI", "PR");
   by State Segment;
   if first.Segment then PolyID+1;    /* create ID variable for polygons */
run;
 
ods graphics / antialiasmax=1500 labelmax=1500; /* enable anti-aliasing and labels */
title "Outline of 48 US States";
proc sgplot data=US;
   polygon x=x y=y ID=PolyID / fill outline;
run;
polygonmap1

Notice that the MAPS.US data set uses projected radians as the coordinates for the state boundaries. You can multiply these values by 180/π to convert to degrees. You can use the GPROJECT procedure in SAS/GRAPH software to project data sets that provide longitude and latitude variables.

Overlays on polygon plots

You can overlay data on a polygon plot. In particular, you can use the BUBBLE, TEXT, SCATTER, and VECTOR statements, as well as additional POLYGON statements to overlay additional spatial information on a map. The following DATA step reads the coordinates of a few large cities from the MAPS.USCITY data set and concatenates that information with the state-boundary data. The SCATTER statement overlays the cities and labels them.

data All; 
   set maps.uscity(where=(statecode NOT IN ("AK" "HI" "PR") & pop > 5e5)
                   rename=(x=cityX y=cityY))
       US;
run;
 
title "US States with Some Large Cities";
proc sgplot data=All noautolegend;
   polygon x=x y=y ID=PolyID / fill outline transparency=0.75 group=StateCode;
   scatter x=cityX y=cityY / datalabel=city;
   xaxis display=none;
   yaxis display=none;
run;

The graph is shown at the top of this article. Notice that the GROUP= option in the POLYGON statement is used to color each state.

With additional work, you can use the GROUP= option to create a choropleth map. You can use the DATTRMAP= option on the PROC SGPLOT statement to specify colors for the various group values. Alternatively, you can use the STYLEATTRS statement to assign colors for groups.

The SGPLOT procedure does not replicate the functionality of the SAS/GRAPH software. The mapping procedures in SAS/GRAPH software (which includes GREDUCE, GPROJECT, GMAP, and MAPIMPORT) enable you to manipulate and visualize map data in more complex ways. However, the POLYGON statement in PROC SGPLOT enables you to create some simple maps that visualize countries, states, and counties, and to overlay graphical elements by using other SGPLOT statements.

Post a Comment

Label markers in graphs by using the values of several variables

In many procedures, the ID statement is used to identify observations by specifying an identifying variable, such as a name or a patient ID. In many regression procedures, you can specify multiple ID variables, and all variables are copied into output data sets that contain observation-wise statistics such as predicted values and residuals.

Values of an ID variable can also be used to label markers on a graph. For example, the following call to PROC REG models the weights of 19 students as a linear function of their heights. The PLOTS=COOKSD(LABEL) option requests a plot of Cook's D statistic, which gives information about observations that appear to have a large influence on the parameter estimates. The LABEL option requests that extreme values of the Cook's D statistic be labeled by using the ID variable. Let's see what happens if you specify two variables on the ID statement:

proc reg data=sashelp.class plots(only)=cooksd(label);
model weight = height;
id name age;             /* specify TWO ID variables */
ods select CooksDPlot;
quit;
concatID1

I've added a red rectangle that surrounds the labeled marker. Although two variables (NAME and AGE) were specified in the ID statement, the graph clearly shows that only the first variable (NAME) was used as a label. The documentation for the PLOTS= option tells us that this is the expected behavior: "If you specify one or more ID variables..., then the first ID variable [that] you specify is used for the labeling."

Most procedures exhibit a similar behavior, although a few (LOGISTIC, TRANSREG, ...) provide tooltips or other support for multiple ID variables.

For the procedures that support labeling by only one variable, there is an easy workaround: create a new ID variable by concatenating several other variables.

Concatenating multiple ID variables

You can use the concatenation operator (||) in the DATA step to concatenate two variables together. The following statements concatenate values of the NAME and AGE variables. The TRIM function can be used to remove trailing blanks from the NAME variable. In many cases, the DATA step will automatically convert the numeric AGE variable to a character value, but the following statements use the PUT function to explicitly convert a numeric variable to a character value prior to concatenation:

/* concatenate NAME and AGE to make new ID variable */
data class;
set sashelp.class;
length labl $12;    /* NAME holds 8 characters. Allow for delimiter and age */
labl = trim(Name) || ": " || put(age, 2.);  /* convert two-digit age to character */
run;
 
proc reg data=class plots(only)=cooksd(label);
model weight = height;
id labl;               /* label by using new variable */
ods select CooksDPlot;
quit;
concatID2

With this change, each extreme point in the graph is labeled with the values of the NAME and AGE variables.

Labeling markers in ODS graphics

You can use the same trick to label a marker in a graph created by PROC SGPLOT. The canonical example is a scatter plot in which you want to label all observations. Because the DATALABEL= option supports only a single variable, you have to use concatenation if you want to see the values of multiple variables:

proc sgplot data=class;
scatter x=height y=weight / datalabel = labl;
run;
concatID3

A drawback of the "concatenation technique" is that most graphs do not have room to display long labels for many points. The DATALABEL= option uses a collision-avoidance algorithm to avoid overplotting labels, but if you have hundreds or thousands of markers, it is best to use another mechanism for labeling markers.

One option to display multiple labels is to label only certain observations in the graph, such as observations that have extreme values. Another option, which works if you are going to examine the graph on your computer, is to use the TIP= option to create a tooltip for the scatter plot. With a tooltip, you can hover a pointer over a marker and view a pop-up window that displays the values of multiple variables.

Post a Comment