Sometimes you need to reverse the data before you fit a distribution

Occasionally on a discussion forum, a statistical programmer will ask a question like the following:

I am trying to fit a parametric distribution to my data. The sample has a long tail, so I have tried the lognormal, Weibull, and gamma distributions, but nothing seems to fit. Please help!!

In general, there is no reason to expect a particular distribution to fit arbitrary data. However, sometimes the person asking the question has a theoretical reason why the model should fit, such as the data are supposed to be "lifetime" data that are part of a reliability or survival analysis.

For several situations that I can remember, the problem occurred because the data distribution was skewed to the left, whereas by convention the usual "named" distributions have positive skewness. The following sample of 50 data values provides an example:

data Sample;
input X @@;
81 91 90 91 87 56 93 80 80 89 93 87 86 58 81 90 82 71 85 94 
86 79 82 89 87 87 96 76 77 91 87 67 93 84 90 88 78 92 87 86 
61 82 83 92 81 83 87 91 84 72 
proc univariate data=Sample;
   histogram X / endpoints=55 to 100 by 5
         odstitle="Distribution of Sample Data";
Sample data that has negative skewness

The descriptive statistics from PROC UNIVARIATE (not shown) indicate that the sample skewness is about -1.5. The histogram confirms that the data distribution has negative skewness. Consequently, the lognormal, Weibull, and gamma distributions will not fit these data well.

A transformation that reverses the data distribution

You can transform the data so that the skewness is positive and the long tail is to the right. To do this correctly requires domain-specific knowledge, but the general idea is to apply a linear transformation of the form Y = cb X for some constants c and b. If you don't want to change the scale of the data, use b = 1.

For example, suppose that the data are the results of an assessment procedure that assigns a value between 0 (bad) and 100 (good) to each item on an assembly line. An alternative way to score each item is to record the number of points that are deducted by the assessment procedure. For the alternative scoring system, low scores are good and high scores are bad. The conversion between the scoring systems is simply Y = 100 – X. The following DATA step creates the new scores and overlays several parametric models that fit the new transformed data:

data Transform;
set Sample;
Y = 100 - X;
proc univariate data=Transform;
   var Y;
   histogram / endpoints=0 to 45 by 5
         odstitle="Distribution of Reversed Data"
         lognormal(threshold=0)  Weibull(threshold=0)  gamma(threshold=0); 
Transformed data has positive skewness

The transformed data has positive skewness. I used knowledge of the data measurements to choose reasonable values for the linear transformation that flips the data distribution. If you know nothing about the data, you could choose c to be any value greater than the maximum data value for X. That guarantees that the transformed data could be modeled by a distribution that has zero for the threshold parameter. Try to choose a transformation for which the new measurements are easy to understand; different values of c will lead to different estimates for the parameters.

In summary, many standard modeling distributions (exponential, lognormal, gamma, Weibull, ...) assume that the data are positively skewed. If your data has negative skewness, try to use a linear transformation to reverse the data before you model it.

Post a Comment

Counting observations for which two events occur

Every year near Halloween I write an article in which I demonstrate a simple programming trick that is a real treat to use. This year's trick (which features the CMISS function and the crossproducts matrix in SAS/IML) enables you to count the number of observations that are missing for pairs of columns. In other words, this trick counts how many missing values are in columns 1 and 2, columns 1 and 3, and so on for all "k choose 2" pairwise combinations of k columns.

This trick generalizes, so keep reading even if you don't care about counting missing values. Given ANY indicator matrix, you can use this trick to count the pairwise occurrence of events.

Counting pairwise events: A connection to matrix multiplication #Statistics Click To Tweet

Counting missing values

Of course, you can count combinations of missing values without using SAS/IML or matrices. I've previously shown how to use PROC MI to count all combinations of missing values among a set of k variables. However, as I said, this SAS/IML trick enables you to count more than just missing values: it generalizes to count ANY pairwise event.

To get started, read data into a SAS/IML matrix. The following statements read 5209 observations and eight variables into the matrix X:

proc iml;
varNames = {"AgeAtStart" "Diastolic" "Systolic" 
            "Height" "Weight" "MRW" "Smoking" "Cholesterol"};
use Sashelp.Heart;               /* open data set */
read all var varNames into X;    /* create numeric data matrix, X */
close Sashelp.Heart;

The first part of the trick is to convert the matrix into a 0/1 indicator matrix, where 1 indicates that a data value is missing and 0 indicates nonmissing. I like to use the CMISS function in Base SAS for this task.

The second part of the trick is to recognize that if C is an indicator matrix, the crossproducts matrix P=C`C is a matrix that gives the counts of the pairwise events in C. The element P[i,j] is the inner product of the i_th and j_th columns of C, and because C is an indicator matrix the inner product counts the number of simultaneous "events" for column i and column j.

In the SAS/IML language, the code to compute pairwise counts of missing values requires only two lines:

C = cmiss(X);                    /* create 0/1 indicator matrix */
P = C`*C;                        /* pairwise counts */
print P[c=varNames r=varNames label="Pairwise Missing Counts"];

The P matrix is symmetric. The diagonal elements P[i,i] show the number of missing values for the i_th variable. For example, the Height, Weight, and MRW variables have 6 missing values whereas the Cholesterol variable has 152 missing values. The off-diagonal elements P[i,j] show the number of observations that are simultaneously missing for the i_th and j_th variables. For example, 28 observations have missing values for both the Smoking and Cholesterol variables; two observations have missing values for both the Height and Weight variables.

Counting any pair of events

If C is any indicator matrix, the crossproducts matrix P=C`C counts the number of observations for which two columns of C are simultaneously 1.

For example, the following statements standardize the data to create z-scores. You can form an indicator matrix that has the value 1 if a z-score is exceeds 3 in absolute value; these are outliers for normally distributed data. The crossproducts matrix counts the number of observations that are univariate outliers (on the diagonal) and the number that are outliers for a pair of variables:

Z = (X - mean(X)) / std(X);      /* standardize data */
L = (abs(Z) > 3);                /* L indicates outliers */
Q = L`*L;                        /* counts for pairs of outliers */
print Q[c=varNames r=varNames label="Pairwise Outliers"];

The table indicates that 52 patients have a diastolic blood pressure (BP) that is more than three standard deviations above the mean. Similarly, 83 patients are outliers for systolic BP, and 35 patients are outliers for both measurements.

The following graph visualizes the patients that have high blood pressure in each category:

Outlier = j(nrow(L), 1, "No         ");       /* create grouping variable */
Outlier[ loc(L[,2] | L[,3]) ] = "Univariate";
Outlier[ loc(L[,2] & L[,3]) ] = "Bivariate";
title "Standardized Blood Pressure";
title2 "Outliers More than Three Standard Deviations from Mean";
call scatter(Z[,2], Z[,3]) group=Outlier 
         label={"Standardized Diastolic" "Standardized Systolic"}
         option="transparency=0.5 markerattrs=(symbol=CircleFilled)"
         other="refline 3/axis=x; refline 3/axis=y;";

In conclusion, you can create an indicator matrix for any set of conditions. By forming the crossproducts matrix, you get the counts of all observations that satisfy each condition individually and in pairs. I think this trick is a real treat!

Post a Comment

Create patterns of missing data

When simulating data or testing algorithms, it is useful to be able to generate patterns of missing data. This article shows how to generate random and systematic patterns of missing values. In other words, this article shows how to replace nonmissing data with missing data.

Create patterns of missing data in #SAS Click To Tweet

Generate a random pattern of missing values

The following SAS/IML program reads numerical data into a matrix from the Sashelp.Class data set. The matrix has 16 rows and three columns. The program then generates a matrix of the same size that contains a random pattern of zeros and ones, where about 40% of the values will be ones. The LOC function is used to find the locations of the ones, and the corresponding locations in the data are set to missing:

proc iml;
use Sashelp.Class;                    /* read numeric data into X */
read all var _NUM_ into X;
/* random assignment of missing values */
RandX = X;                                          /* copy data */
p = 0.4;                /* approx proportion of missing elements */
call randseed(1234);
B = randfun(dimension(X), "bern", p);         /* random 0s or 1s */
missIdx = loc(B=1);                       /* find position of 1s */
if ncol(missIdx)>0 then
   RandX[missIdx] = .;                /* replace 1s with missing */
print RandX;
 A ran dom pattern of missing values

In this way, you can replace a certain percentage of the data values with missing values.

Generate a systematic pattern of missing values

In the preceding section, the technique for inserting missing values does not use the fact that the matrix B is random. The technique works with any zero-one matrix B that specifies a pattern of missing values. For example, you can create a matrix that contains all combinations of zeros and ones, then use that pattern to set missing values, as follows:

C = { 0 0 0, 
      0 0 1, 
      0 1 0, 
      0 1 1, 
      1 0 0, 
      1 0 1, 
      1 1 0, 
      1 1 1 };                              /* pattern matrix */
missIdx = loc(C=1);     
SysX = X;                                        /* copy data */
if ncol(missIdx)>0 then
   SysX[missIdx] = .;              /* replace 1s with missing */
print SysX;
Systematic pattern of m issing values

You could also specify the locations of the missing values by using subscripts of the data matrix. You can use the SUB2NDX function to convert subscripts to indices.

Patterns of missing data by using the SAS DATA step

In the SAS DATA step you can use arrays to create a random pattern of missing values. For example, the following SAS data set reads numerical variables from the Sashelp.Class data and randomly assigns 40% of the data to missing values:

/* generate missing values in random locations */
data RandClass(drop=i);
call streaminit(1234);
set Sashelp.Class(keep=_NUMERIC_);
array x {*} _numeric_;
do i = 1 to dim(x);
   if rand("Bern", 0.4) then         /* p=0.4 ==> about 40% missing */
proc print; run;

The output is not shown, but the random pattern is identical to the random pattern that was generated by using SAS/IML matrices.

You could use the DATA step to specify patterns of missing values for which there is a formula, such as every fourth data value (MOD(cnt,4)=1). However, it is less easy to generate an arbitrary pattern, such as the "all combinations" pattern in the previous section. In general, I think the SAS/IML approach is easier to use and more flexible.

For any pattern of missing values, you can use PROC MI to summarize the pattern. You can also use various graphical techniques to visualize the pattern of missing data.

Post a Comment

Ahh, that's smooth! Anti-aliasing in SAS statistical graphics

I've written several articles about scatter plot smoothers: nonparametric regression curves that reveal small- and large-scale features of a response variable as a function of an explanatory variable. However, there is another kind of "smoothness" that you might care about, and that is the apparent smoothness of curves and markers that are rendered on your computer screen or other device.

Recall that anti-aliasing is a graphical technique in which some pixels along a rendered curve are set to an intermediate color, which makes a curve look smoother. For example, if a curve is being drawn by using a black pen, some of the neighboring pixels along the rendered curve are set to shades of grey, which tricks the eye into seeing a smooth curve instead of a jagged, pixellated curve.


Anti-aliasing in ODS statistical graphics

The ANTIALIAS= and ANTIALIASMAX= options were added to the ODS GRAPHICS statement in SAS 9.2. A typical usage follows:

ods graphics / antialias=on antialiasmax=5000;

The ANTIALIAS= option specifies whether to anti-alias. By default, anti-aliasing is on. Because it can be expensive to anti-alias many thousands of graphical elements, the ANTIALIASMAX= option enables you to specify the maximum number of elements (markers or curve points) that can be in a plot before anti-aliasing is disabled for that plot. The default value is ANTIALIASMAX=4000 for SAS 9.4m3. However, the default is only 600 for earlier releases, so you might want to bump up that value when you need a presentation-quality graphic that has thousands of graphical elements. If SAS disables anti-aliasing for a plot because the plot contains too many elements, the SAS log will contain a note similar to the following:

NOTE: Marker and line anti-aliasing has been disabled because
 the threshold has been reached. You can set
 ANTIALIASMAX=1000 in the ODS GRAPHICS statement to restore

Subpixel rendering in ODS graphics

A related option is turning on subpixel rendering by using the SUBPIXEL option. The SUBPIXEL option was added to the ODS GRAPHICS statement in SAS 9.4m3, but it has been available on the PROC SGPLOT statement for several 9.4 releases.

The SAS documentation for the ODS GRAPHICS statement says that the SUBPIXEL option "produces smoother curves and more precise bar spacing." There is a section in the documentation titled "Subpixel Rendering," which demonstrates the impact that subpixel rendering can have on curves and bar charts.

The documentation says that subpixel rendering "is enabled by default for image output, unless the graph contains a scatter plot or a scatter-plot matrix. In those cases, subpixel rendering is disabled by default."

For me, subpixel rendering solves a problem that I've experienced when I create a large bar chart with many categories. The number of bars, the width of the bars, and the dimensions of the graph determine whether the number of pixels between bars is uniform or whether some gaps are larger than others. Sometimes you will see small uneven gaps between the bars, as shown on the left side of the following plot. However, subpixel rendering improves the plot tremendously, as shown on the right side:


In SAS 9.4m3 and beyond, the SUBPIXEL option applies to all plot types. Prior to SAS 9.4m3, the option applied only to line charts and bar charts; see the documentation of the PROC SGPLOT statement for the specific plots that were supported.

Try it yourself

I think the best way to learn about anti-aliasing and subpixel rendering is to try it out yourself! These ODS options apply to all ODS statistical graphics, including those that are created by SAS analytical procedures. Remember, however, that the option was only available in the PROC SGPLOT statement for 9.4 releases prior to m3.

The following SAS statements enable you to play with the options and see the differences for a simple loess curve overlaid on a scatter plot:

ods graphics / reset ANTIALIAS=off;       /* anti-aliasing off */
proc sgplot data=Sashelp.ENSO;
   loess y=Pressure x=Month / smooth=0.3 degree=2;
ods graphics / ANTIALIAS=on ANTIALIASMAX=10000 SUBPIXEL=off; /* anti-aliasing on */
proc sgplot data=Sashelp.ENSO;
   loess y=Pressure x=Month / smooth=0.3 degree=2;
ods graphics / ANTIALIAS=on ANTIALIASMAX=10000 SUBPIXEL=on;  /* SAS 9.4m3 */
proc sgplot data=Sashelp.ENSO;
   loess y=Pressure x=Month / smooth=0.3 degree=2;

Further reading

The following resources provide further information about anti-aliasing and subpixel rendering in ODS graphics:

Post a Comment

Loess regression in SAS/IML

A previous post discusses how the loess regression algorithm is implemented in SAS. The LOESS procedure in SAS/STAT software provides the data analyst with options to control the loess algorithm and fit nonparametric smoothing curves through points in a scatter plot.

Although PROC LOESS satisfies 99.99% of SAS users who want to fit a loess model, some research statisticians might want to extend or modify the standard loess algorithm. Researchers like to ask "what if" questions like "what if I used a different weighting function?" or "what if I change the points at which the loess model is evaluated?" Although the loess algorithm is complicated, it is not hard to implement a basic version in a matrix language like SAS/IML.

Implement a basic version of loess regression in SAS/IML. #SAStip Click To Tweet

Implement loess regression in SAS/IML

Recent blog posts have provided some computational modules that you can use to implement loess regression. For example, the PairwiseNearestNbr module finds the k nearest neighbors to a set of evaluation points. The functions for weighted polynomial regression computes the loess fit at a particular point.

You can download a SAS/IML program that defines the nearest-neighbor and weighted-regression modules. The following call to PROC IML loads the modules and defines a function that fits a loess curve at points in a vector, t. Each fit uses a local neighborhood that contains k data values. The local weighted regression is a degree-d polynomial:

proc iml;
load  module=(PairwiseNearestNbr PolyRegEst PolyRegScore);
/* 1-D loess algorithm. Does not handle missing values.
   Input: t: points at which to fit loess (column vector)
          x, y: nonmissing data (column vectors)
          k: number of nearest neighbors used for loess fit
          d: degree of local regression model 
   Output: column vector L[i] = f(t[i]; k, d) where f is loess model
start LoessFit(t, x, y, k, d=1);
   m = nrow(t);
   Fit = j(m, 1);                            /* Fit[i] is predicted value at t[i] */
   do i = 1 to m;
      x0 = t[i];
      run PairwiseNearestNbr(idx, dist, x0, x, k);
      XNbrs = X[idx];  YNbrs = Y[idx];       /* X and Y values of k nearest nbrs */
      /* local weight function where dist[,k] is max dist in neighborhood */
      w = 32/5 * (1 - (dist / dist[k])##3 )##3; /* use tricubic weight function */
      b = PolyRegEst(YNbrs, XNbrs, w`, d);   /* param est for local weighted regression */
      Fit[i] = PolyRegScore(x0, b);          /* evaluate polynomial at x0 */
   return Fit;

This algorithm provides some features that are not in PROC LOESS. You can use this function to evaluate a loess fit at arbitrary X values, whereas PROC LOESS evaluates the function only at quantiles of the data. You can use this function to fit a local polynomial regression of any degree (for example, a zero-degree polynomial), whereas PROC LOESS fits only first- and second-degree polynomials. Although I hard-coded the standard tricubic weight function, you could replace the function with any other weight function.

On the other hand, PROC LOESS supports many features that are not in this proof-of-concept function, such as automatic selection of the smoothing parameter, handling missing values, and support for higher-dimensional loess fits.

Call the loess function on example data

Let's use polynomials of degree 0, 1, and 2 to compute three different loess fits. The LoessData data set is defined in my previous article:

use LoessData;  read all var {x y};  close;   /* read example data */
s = 0.383;                /* specify smoothing parameter */
k = floor(nrow(x)*0.383); /* num points in local neighborhood */
/* grid of points to evaluate loess curve (column vector) */
t    = T( do(min(x), max(x), (max(x)-min(x))/50) );
Fit0 = LoessFit(t, x, y, k, 0);    /* loess fit with degree=0 polynomials */
Fit1 = LoessFit(t, x, y, k, 1);    /* degree=1 */
Fit2 = LoessFit(t, x, y, k, 2);    /* degree=2 */
create Sim var {x y t Fit0 Fit1 Fit2};  append;  close;

You can use PROC SGPLOT to overlay these loess curves on a scatter plot of the data:

title "Overlay loess curves computed in SAS/IML";
proc sgplot data=Sim;
label Fit0="Loess (Deg=0)" Fit1="Loess (Deg=1)" Fit2="Loess (Deg=2)"; 
scatter x=x y=y;
series x=t  y=Fit0 / curvelabel;
series x=t  y=Fit1 / curvelabel lineattrs=(color=red);
series x=t  y=Fit2 / curvelabel lineattrs=(color=ForestGreen);
xaxis grid; yaxis grid;
Loess curves computed in SAS/IML

The three curves are fairly close to each other on the interior of the data. The degree 2 curve wiggles more than the other two curves because it uses a higher-degree polynomial. The over- and undershooting becomes even more pronounced if you use cubic or quartic polynomials for the local weighted regressions.

The curious reader might wonder how these curves compare to curves that are created by PROC LOESS or by the LOESS statement in PROC SGPLOT. In the attached program I show that the IML implementation produces the same predicted values as PROC LOESS when you evaluate the models at the same set of points.

Most SAS data analysts are happy to use PROC LOESS. They don't need to write their own loess algorithm in PROC IML. However, this article shows that IML provides the computational tools and matrix computations that you need to implement sophisticated algorithms, should the need ever arise.

Post a Comment

What is loess regression?

Scatter Plot with Loess Smoother

Loess regression is a nonparametric technique that uses local weighted regression to fit a smooth curve through points in a scatter plot. Loess curves are can reveal trends and cycles in data that might be difficult to model with a parametric curve. Loess regression is one of several algorithms in SAS that can automatically choose a smoothing parameter that best fits the data.

In SAS, there are two ways to generate a loess curve. When you want to see statistical details for the fit, use the LOESS procedure. If you just want to overlay a smooth curve on a scatter plot, you can use the LOESS statement in PROC SGPLOT.

This article discusses the 1-D loess algorithm and shows how to control features of the loess regression by using PROC LOESS and PROC SGPLOT. You can also use PROC LOESS to fit higher dimensional data; the PROC LOESS documentation shows an example of 2-D loess, which fits a response surface as a function of two explanatory variables.

Overview of the loess regression algorithm

The loess algorithm, which was developed by Bill Cleveland and his colleagues in the late '70s through the 'early 90s, has had several different incarnations. Assume that you are fitting the loess model at a point x0, which is not necessarily one of the data values. The following list describes the main steps in the loess algorithm as implemented in SAS:

  1. Choose a smoothing parameter: The smoothing parameter, s, is a value in (0,1] that represents the proportion of observations to use for local regression. If there are n observations, then the k = floor(n*s) points closest to x0 (in the X direction) form a local neighborhood near x0.
  2. Find the k nearest neighbors to x0: I recently showed a SAS/IML module that can compute nearest neighbors.
  3. Assign weights to the nearest neighbors: The loess algorithm uses a tricubic weight function to weight each point in the local neighborhood of x0. The weight for the i_th point in the neighborhood is
    wi = (32/5) (1- (di / D)3 )3
    where D is the largest distance in the neighborhood and di is the distance to the i_th point. (The weight function is zero outside of the local neighborhood.) The graph of the weight function is shown below: Tricubic weight function for loess regression
    The weight function gives more weight to observations whose X value is close to x0 and less weight to observations that are farther away.
  4. Perform local weighted regression: The points in the local neighborhood of x0 are used to fit and score a local weighted regression model at x0.

These four steps implement the basic loess method. The SAS procedures add a fifth step: optimize the smoothing parameter by fitting multiple loess models. You can use a criterion such as the AICC or GCV to balance the tradeoff between a tight fit and a complex model. For details, see the documentation for selecting the smoothing parameter.

How to score a loess regression model

The previous section told you how to fit a loess model at a particular point x0. PROC LOESS provides two choices for the locations at which you can evaluate the model:

  • By default, PROC LOESS evaluates the model at a data-dependent set of points, V, which are vertices of a k-d tree. Think of the points of V as a grid of X values. However, the grid is not linearly spaced in X, but is approximately linear in the quantiles of the data.
  • You can evaluate the model at each unique X data value by using the DIRECT option on the MODEL statement.

If you want to score the model on a set of new observations, you cannot use the direct method. When you score new observations by using the SCORE statement, PROC LOESS uses linear or cubic interpolation between the points of V and the new observations. You can specify the interpolation scheme by using the INTERP= option on the MODEL statement.

Comparing PROC LOESS and the LOESS statement in PROC SGPLOT

The MODEL statement for the LOESS procedure provides many options for controlling the loess regression model. The LOESS statement in PROC SGPLOT provides only a few frequently used options. In some instances, PROC SGPLOT uses different default values, so it is worthwhile to compare the two statements.

  • Choose a smoothing parameter: In both procedures, you can choose a smoothing parameter by using the SMOOTH= option.
  • Fit the local weighted regression: In both procedures, you can control the degree of the local weighted polynomial regression by using the DEGREE= option. You can choose a linear or a quadratic regression model. Both procedures use the tricubic function to determine weights in the local neighborhood.
  • Choose an optimal smoothing parameter: PROC LOESS provides the SELECT= option for controlling the selection of the optimal smoothing parameter. PROC SGPLOT does not provide a choice: it always optimizes the AICC criterion with the PRESEARCH suboption.
  • Evaluate the fit: Both procedures evaluate the fit at a set of data-dependent values, then uses interpolation to evaluate the fit at other locations.
    • In PROC LOESS, you can use the SCORE statement to interpolate at an arbitrary set of points. You use the INTERP= option in the MODEL statement to specify whether to use linear or cubic interpolation.
    • In PROC SGPLOT, the interpolation is performed on a uniform grid of points. The default grid contains 201 points between min(x) and max(x), but you can use the MAXPOINTS= option to change that number. You use the INTERPOLATION= option to specify linear or cubic interpolation.

A loess example in SAS

The following SAS DATA step creates 30 observations for X and Y variables. The call to PROC LOESS creates a loess curve to the data and creates a fit plot, a residual plot, and a panel of diagnostic plots. Only the fit plot is shown:

data LoessData;
input x y @@;
11.7  2.3  19.9  8.1  11.8  4.6  17.1  5.1  16.5  4.8
 5.6  1.7  12.9  5.4   7.6  3.0   9.0  4.8  17.5  5.0
10.4  1.3  16.9  2.8   5.6  1.8  18.7  6.9   3.7  1.7
 7.4  2.3   2.0  2.7  14.8  5.2   3.0  0.0  16.8  4.2
15.0  6.6  19.9  5.5   1.9  1.1  14.8  5.8  12.4  3.0
14.0  6.4  11.7  3.6   8.2  2.9  18.8  6.2   0.3  1.8
ods graphics on;
ods select FitPlot;
proc loess data=LoessData plots=FitPlot;
model y = x / interp=linear           /* LINEAR or CUBIC */
              degree=1                /* 1 or 2 */
              select=AICC(presearch); /* or SMOOTH=0.383 */

For the PROC LOESS call, all options are the default values except for the PRESEARCH suboption in the SELECT= option. You can create the same fit plot by using the LOESS statement in PROC SGPLOT. The default interpolation scheme in PROC SGPLOT is cubic, so the following statements override that default option:

title "PROC SGPLOT with LOESS Statement";
proc sgplot data=LoessData  noautolegend;
loess x=x y=y / interpolation=linear  /* CUBIC or LINEAR */
                degree=1              /* 1 or 2 */
                ;  /* default selection or specify SMOOTH=0.383 */
xaxis grid; yaxis grid;
Comparison of loess regression curves in SAS: PROC LOESS versus PROC SGPLOT

The two plots are shown side by side. The one on the left was created by PROC LOESS. The one on the right was created by PROC SGPLOT.

In conclusion, SAS provides two ways to overlay a smooth loess curve on a scatter plot. You can use PROC LOESS when you want to see the details of statistical aspects of the fit and the process that optimizes the smoothing parameter. You can use the SGPLOT procedure when you care less about the details, but simply want an easy way to show a nonlinear relationship between a response and an explanatory variable.

Post a Comment

The empty-space distance plot

How far away is the nearest hospital? How far is the nearest restaurant? The nearest gas station? These are commonly asked questions whose answers depend on the location of the person asking the question.

Recently I showed an algorithm that enables you to find the distance between a set of locations and a fixed set of reference sites. The set of locations (the people) can be arbitrary, so a good way to summarize the problem is to draw a contour plot that shows the distance from every point in a region to the nearest reference site (a hospital or store). In SAS, you can create this graph by using the SPP procedure, which analyzes spatial point processes. The graph is called an empty-space plot.

Example: The location of large US cities

Some people love big cities and the diverse opportunities that they provide. They want to know how far it is to the nearest big city. Other people hate the noise and congestion; they want to be as far away from a big city as possible. To demonstrate how to create an empty-space plot, let's look at the spatial distribution of US cities that have population more than 200,000.

The following SAS DATA step reads data from the Sashelp.USCity data set, which is distributed as part of SAS/GRAPH software. The BigCities data contains only cities in the contiguous US for which the population is greater than 200,000. The call to PROC SGPLOT creates a scatter plot that shows the projected longitude and latitude (measures in radians) of these 86 cities.
data BigCities;
set maps.uscity;                          /* part of SAS/GRAPH */
where statecode not in ("AK" "HI" "PR") & pop > 200000;
if pop>500000 then Name=City;             /* label really big cities */
else Name = "        ";
title  "Large US Cities";
title2 "Population > 200,000";
proc sgplot data=BigCities;
  scatter x=x y=y / markerattrs=(symbol=CircleFilled) datalabel=Name;
Locations of US cities with population greater than 200,000

The scatter plot supports a well-known fact: cities are not settled in random locations. People tend to settle near beneficial geographic features, such as proximity to rivers, harbors, and other strategic resources. Therefore the cities are not spread uniformly throughout the country. There are clusters of cities on both coasts and there is a large empty space in the rugged northwest. The large empty space contain locations that are much farther away from a big city than would be expected for a random uniform assortment of 86 cities.

The following call to PROC SPP create an empty-space plot, which reveals regions that are far from a big city. The F option will be explained in a moment:

proc spp data=BigCities plots(equate)=(emptyspace(obs=on) F);
   process BigCities = (x, y / area=(-0.37,-0.19,0.32, 0.24)) / F;
Empty-space plot for large US cities, created in SAS

The dark blue color indicates geographic coordinates that are far from any large city. The darkest blue appears to be in the northwestern states, such as Montana, Idaho, and Wyoming. Western Texas also stands out. Those are great places to live for people who want to be far from large cities. There is also dark blue in regions that are uninhabitable (oceans and the Great Lake) or are not part of the US (Mexico and Canada).

I think the image is beautiful, but it is useful as well. It gives a visual summary of locations that are far from the reference sites (big cities).

Empty-space distances

F function for average spatial ditances, computed with PROC SPP in SAS

If you want a more statistical analysis, the F option in the PROCESS statement and in the PLOTS= option tells PROC SPP to lay down a regular grid of points and compute the distance from each grid position to the nearest big city. (By default the grid is 50 x 50.) The graph shows the cumulative distribution of these distances in blue. You can compare the distribution of these distances to the expected distribution for randomly selected locations, which is shown in red and labeled "complete spatial randomness."

You can see that the empirical distribution rises more slowly than the theoretical curve. This indicates that the observed locations are more clustered than you would expect if the cities were randomly located. For details about the F option, see the documentation for PROC SPP.

The main purpose of this example is to demonstrate that PROC SPP makes it easy to compute the empty-space plot, which summarizes the distance between an arbitrary location and a pattern of points. In this article the points were large US cities, but obviously you could analyze other point patterns. The SPP procedure has many options that indicate whether the pattern appear to be randomly distributed or clustered. The procedure indicates that the locations of large US cities are not random.

Post a Comment

WHERE operators in SAS: Multiple comparisons and fuzzy matching

The WHERE clause in SAS is a powerful mechanism for selecting observations as you read or write a data set. The WHERE clause supports many operators, including the IN operator, which enables you to compactly specify multiple conditions for a categorical variable.

A common use of the IN operator is to specify a list of US states and territories that should be included or excluded in an analysis. For example, the following DATA step reads the Sashelp.Zipcode data, but excludes zip codes for the states of Alaska ("AK"), Hawaii ("HI"), and US territories such as Puerto Rico ("PR"), Guam ("GU"), and so on:

data Lower48;
set Sashelp.Zipcode(where=(   /* exclude multiple states and territories */
    Statecode not in ("AK" "HI" "VI" "GU" "FM" "MP" "MH" "PW"))

WHERE operators in SAS/IML are vectorized

In my previous article about how to use the WHERE clause in SAS/IML, my examples used scalar comparisons such as where(sex="F") to select only females in the data. The SAS/IML language does not support the IN operator, but there is another compact way to include or exclude multiple values. Because SAS/IML is a matrix-vector language, many operations support vector arguments. In particular, the WHERE clause in the SAS/IML language enables you to use the ordinary equal operator (=) and specify a vector of values on the right hand side!

For example, the following statement reads in all US zip codes in the contiguous US and creates a scatter plot of their locations:

proc iml;
excludeList = {"AK" "HI" "PR" "VI" "GU" "FM" "MP" "MH" "PW"};
use Sashelp.Zipcode where(Statecode ^= excludeList);   /* vector equiv of NOT IN */
read all var {X Y "City" "Statecode"};
title "Centers of US ZIP Codes";
call scatter(X, Y) group=Statecode option="markerattrs=(size=2)" 
     label={"Longitude" "Latitude"} procopt="noautolegend";
ZIP code locations filtered by a WHERE clause in SAS/IML

The WHERE clause skips observations for which the Statecode variable matches any of the values in the excludeList vector. The scatter plot reveals the basic shape of the contiguous US. You can see that the plot does not display locations from Alaska, Hawaii, or US territories.

String matching operators in the SAS WHERE clause

As long as we're talking about the WHERE clause in SAS, let's discuss some string-matching operators that might not be familiar to some SAS programmers. I'll use SAS/IML for the examples, but these operators are generally supported (in scalar form) in all SAS WHERE clauses. The operators are

  • The "contains" operator (?)
  • The "not contains" operator (^?)
  • The "begins with" operator (=:)
  • The "sounds like" operator (=*)

All these operators are documented in the list of WHERE clause operators in SAS/IML.

WHERE operators in #SAS: string matching and fuzzy matching Click To Tweet

The "contains" operator (?) and the "not contains" operator (^?) match a substring that appears anywhere in the target character variable.

The "begins with" operator (=:) matches substrings that appear at the beginning of a target variable. For example, the following statements select observations for which the state begins with the letter "B", "C", or "D". (There are no US states that begin with "B.") Notice that the "begin with" operator is also vectorized in SAS/IML:

use Sashelp.Zipcode where(Statecode =: {"B" "C" "D"});  /* =:  "begins with" */
read all var {X Y "City" "Statecode"};
u = unique(Statecode);
print u;

Fuzzy matching of English words

Perhaps the most unusual operator in the WHERE clause in SAS is the "sounds like" operator (=*), which does "fuzzy matching" of English words. The operator finds English words that are similar to the specified target words by using the SOUNDEX function in SAS. The SOUNDEX function is often used to select different names that sound alike but have different spelling, such as "John" and "Jon" or "Lynn" and "Lynne." In the following WHERE clause, the "sounds like" operator is used to select observations for which the city sounds similar to either "Cary" or "Asheville." The selected observations are plotted in a scatter plot after eliminating duplicate rows.

use Sashelp.Zipcode where(City =* {"Cary" "Asheville"}); /* =*  "sounds like" */
read all var {X Y "City" "Statecode"};
start UniqueRows(x);            
   cols = 1:ncol(x);               /* sort by all columns */
   call sortndx(ndx, x, cols);     /* ndx = permutation of rows that sorts matrix */
   uRows = uniqueby(x, cols, ndx); /* locate unique rows of sorted matrix */
   return ( ndx[uRows] );          /* rows in original matrix */
r = UniqueRows(City||Statecode);   /* get row numbers in x for unique rows */
call scatter(X[r], Y[r]) group=Statecode[r] datalabel=City[r]
   option="markerattrs=(symbol=CircleFilled)" procopt="noautolegend";
Citites that sound like Cary or Asheville, filtered by a WHERE clause in SAS/IML

According to the "sounds like" operator, the name "Cary" sounds like "Carey," "Cory," and "Cherry." The name "Asheville" sounds like "Ashville," "Ashfield," and "Ash Flat."

In summary, the WHERE clause in SAS/IML works a little differently than the more-familiar version in the SAS DATA step. Both versions enable you to selectively include or exclude observations that satisfy one or more conditions. However, the SAS/IML WHERE clause is vectorized. You can specify a vector of conditions for operators, thus reproducing the functionality of the IN operator.

This article also demonstrates a few lesser-known string operators, such as "contains" (?), "not contains" (^?), "begins with" (=:), and "sounds like" (=*).

Post a Comment

Visualize a weighted regression

What is weighted regression? How does it differ from ordinary (unweighted) regression? This article describes how to compute and score weighted regression models.

Visualize a weighted regression

Technically, an "unweighted" regression should be called an "equally weighted " regression since each ordinary least squares (OLS) regression weights each observation equally. Similarly, an "unweighted mean" is really an equally weighted mean.

Recall that weights are not the same as frequencies. When talking about weights, it is often convenient to assume that the weights sum to unity. This article uses standardized weights, although you can use specify any set of weights when you use a WEIGHT statement in a SAS procedure.

One way to look at a weighted regression is to assume that a weight is related to the variance of an observation. Namely, the i_th weight, wi, indicates that the variance of the i_th value of the dependent variable is σ2 / wi, where σ2 is a common variance. Notice that an observation that has a small weight (near zero) has a relatively large variance. Intuitively, the observed response is not known with much precision; a weighted analysis makes that observation less influential.

Weighted regression assumes that some response values are more precise than others. #StatWisdom Click To Tweet

For example, the following SAS data set defines (x,y) values and weights (w) for 11 observations. Observations whose X value is close to x=10 have relatively large weights. Observations far from x=10 have small weights. The last observation (x=14) is assigned a weight of zero, which means that it will be completely excluded from the analysis.

data RegData;
input y x w;
2.3  7.4  0.058 
3.0  7.6  0.073 
2.9  8.2  0.114 
4.8  9.0  0.144 
1.3 10.4  0.151 
3.6 11.7  0.119 
2.3 11.7  0.119 
4.6 11.8  0.114 
3.0 12.4  0.073 
5.4 12.9  0.035 
6.4 14.0  0 

You can use PROC SGPLOT to visualize the observations and weights. In fact, PROC SGPLOT supports a REG statement that adds a regression line to the plot. The REG statement supports adding both weighted and unweighted regression lines:

proc sgplot data=RegData;
scatter x=x y=y / filledoutlinedmarkers markerattrs=(size=12 symbol=CircleFilled)
                colorresponse=w colormodel=TwoColorRamp;     /* color markers by weight */
reg x=x y=y / nomarkers legendlabel="Unweighted Regression"; /* usual OLS regression */
reg x=x y=y / weight=w nomarkers legendlabel="Weighted Regression"; /* weighted regression */
xaxis grid; yaxis grid;
Weighted regression curves in PROC SGPLOT

Each marker is colored by its weight. Dark blue markers (observations for which 8 < x < 12) have relatively large weights. Light blue markers have small weights and do not affect the weighted regression model very much.

You can see the effect of the weights by comparing the weighted and unweighted regression lines. The unweighted regression line (in blue) is pulled upward by the observations near x=13 and x=14. These observations have small weights, so the weighted regression line (in red) is not pulled upwards.

Weighted regression by using PROC REG

If you want to compute parameter estimates and other statistics, you can use the REG procedure in SAS/STAT to run the weighted and unweighted regression models. The following statement run PROC REG twice: once without a WEIGHT statement and once with a WEIGHT statement:

proc reg data=RegData plots(only)=FitPlot;
   Unweighted: model y = x;
   ods select NObs ParameterEstimates  FitPlot;
proc reg data=RegData plots(only)=FitPlot;
   Weighted: model y = x;
   weight w;
   ods select NObs ParameterEstimates  FitPlot;

The output (not shown) indicates that the unweighted regression model is Y = -0.23 + 0.36*X. In contrast, the weighted regression model is Y = 2.3 + 0.085*X. This confirms that the slope of the weighted regression line is smaller than the slope of the unweighted line.

A weighted regression module in SAS/IML

You can read the SAS documentation to find the formulas that are used for a weighted OLS regression model. The formula for the parameter estimates is a weighted version of the normal equations: b = (X`WX)-1(X`WY), where Y is the vector of observed responses, X is the design matrix, and W is the diagonal matrix of weights.

In SAS/IML it is more efficient to use elementwise multiplication than to multiply with a diagonal matrix. If you work through the matrix equations, you will discover that weighted regression is easily accomplished by using the normal equations on the matrices that result from multiplying the rows of Y and X by the square root of the weights. This is implemented in the following SAS/IML module:

proc  iml;
/* weighted polynomial regression for one regressor. Y, X, and w are col vectors */
start PolyRegEst(Y, X, w, deg);
   Yw = sqrt(w)#Y;                  /* mult rows of Y by sqrt(w) */
   XDesign = j(nrow(X), deg+1);
   do j = 0 to deg;                 /* design matrix for polynomial regression */
      Xdesign[,j+1] = X##j;
   Xw = sqrt(w)#Xdesign;            /* mult rows of X by sqrt(w) */
   b = solve(Xw`*Xw, Xw`*Yw);       /* solve normal equation */
   return b;
use RegData; read all var {Y X w}; close;    /* read data and weights */
/* loop to perform one-variable polynomial regression for deg=0, 1, and 2 */
do deg = 0 to 2;                    
   b = PolyRegEst(Y, X, w, deg);
   d = char(deg,1);                          /* '0', '1', or '2' */
   labl = "Estimates (deg=" + d + ")";
   print b[L=labl rowname=("b0":("b"+d))];   /* print estimates for each model */
Parameter estimates for weighted regression models

The output shows the parameter estimates for three regression models: a "mean model" (degree 0), a linear model (degree 1), and a quadratic model (degree 2). Notice that the parameter estimates for the weighted linear regression are the same as estimates computed by PROC REG in the previous section.

Score the weighted regression models

The previous section describes how to use SAS/IML to compute parameter estimates of weighted regression models, and you can also use SAS/IML to score the regression models. The scoring does not require knowing the weight variable, only the parameter estimates. The following module uses Horner's method to evaluate a polynomial on a grid of points:

/* Score regression fit on column vector x */
start PolyRegScore(x, coef);
   p = nrow(coef);
   y = j(nrow(x), 1, coef[p]);            /* initialize to coef[p] */
   do j = p-1 to 1 by -1; 
      y = y # x + coef[j];

You can compute predicted values for each model on a grid of points. You can then write the predicted values to a SAS data set and combine the predicted values and the original data. Lastly, you can use the SERIES statement in PROG SGPLOT to overlay the three regression models on the original data:

t = T( do(min(x), max(x), (max(x)-min(x))/25) ); /* uniform grid in range(X) */ 
Yhat = j(nrow(t), 3);
do d = 0 to 2;
   b = PolyRegEst(Y, X, w, d);                 /* weighted regression model of degree d */
   Yhat[,d+1] = PolyRegScore(t, b);            /* score model on grid */
Z = t || Yhat;                                 /* write three predicted curves to data set */
create RegFit from Z[c={"t" "Pred0" "Pred1" "Pred2"}];
append from Z;
data RegAll;                                   /* merge predicted curves with original data */
label w="Weight" Pred0="Weighted Mean" 
      Pred1="Weighted Linear Fit" Pred2="Weighted Quadratic Fit";
merge RegData RegFit;
title "Weighted Regression Models";            /* overlay weighted regression curves and data */
proc sgplot data=RegAll;
scatter x=x y=y / filledoutlinedmarkers markerattrs=(size=12 symbol=CircleFilled)
                  colorresponse=w colormodel=TwoColorRamp;
series x=t  y=Pred0 / curvelabel;
series x=t  y=Pred1 / curvelabel;
series x=t  y=Pred2 / curvelabel;
xaxis grid; yaxis grid;
Predicted values for weighted regression models

A visualization of the weighted regression models is shown to the left. The weighted linear fit is the same line that was shown in the earlier graph. The weighted mean and the weighted quadratic fit are the zero-degree and second-degree polynomial models, respectively. Of course, you could also create these curves in SAS by using PROC REG or by using the REG statement in PROC SGPLOT.


Weighted regression has many applications. The application featured in this article is to fit a model in which some response values are known with more precision than others. We saw that it is easy to create a weighted regression model in SAS by using PROC REG or PROC SGPLOT. It is only slightly harder to write a SAS/IML function to use matrix equations to "manually" compute the parameter estimates. No matter how you compute the model, observations that have relatively large weights are more influential in determining the parameter estimates than observations that have small weights.

Post a Comment

Let PROC FREQ create graphs of your two-way tables

The recent releases of SAS 9.4 have featured major enhancements to the ODS statistical graphics procedures such as PROC SGPLOT. In fact, PROC SGPLOT (and the underlying Graph Template Language (GTL)) are so versatile and powerful that you might forget to consider whether you can create a graph automatically by using a SAS statistical procedure. For example, when you turn on ODS graphics (ODS GRAPHICS ON), SAS procedures create the following graphs automatically:

  1. Many SAS regression procedures (and PROC PLM) create effect plots.
  2. PROC SURVEYREG creates a hexagonal bin plot.
  3. PROC REG creates heat maps when a scatter plot would suffer from overplotting.
  4. PROC LOGISTIC creates odds-ratio plots.

Let PROC FREQ visualize your two-way tables

Recently a SAS customer asked how to use PROC SGPLOT to produce a stacked bar chart. I showed him how to do it, but I also mentioned that the same graph could be produced with less effort by using PROC FREQ.

PROC FREQ is a workhorse procedure that can create dozens of graphs. For example, PROC FREQ can create a mosaic plot and a clustered bar chart to visualize frequencies and relative frequencies in a two-way table. Next time you use PROC FREQ, add the PLOTS=ALL option to the TABLES statement and see what you get!

One of my favorite plots for two-way categorical data is the stacked bar chart. As I told the SAS customer, it is simple to create a stacked bar chart in PROC FREQ. For example, the following statements create a horizontal bar chart that orders the categories by frequency:

proc freq data=sashelp.Heart order=freq;
   tables weight_status*smoking_status / 
       plots=freqplot(twoway=stacked orient=horizontal);
Stacked bar chart created by PROC FREQ

See the documentation for the PLOTS= option in the TABLES statement for a description of all the plots that PROC FREQ can create. PROC FREQ creates many plots that are associated with a particular analysis, such as the "deviation plot," which shows the relative deviations between the observed and expected counts when you request a chi-square analysis of a one-way table.

Do you need a highly customized graph?

The ODS graphics in SAS are designed to create many—but not all—of the visualizations that are relevant to an analysis. Some attributes of a graph (for example, the title and the legend placement) are determined by a stored template and can't be modified by using the procedure syntax. Advanced GTL gurus might want to learn how to edit the ODS templates. Less ambitious users might choose to use a statistical procedure to automatically create graphs during data exploration and modeling, but then use PROC SGPLOT to create the final graph for a report.

For example, the following call to PROC FREQ writes the two-way frequency counts to a SAS data set. From the data you can create graphs that are similar to the one that PROC FREQ creates, but you can change the order and colors of the bars, alter the placement of the legend, add text, and more. The following call to PROC SGPLOT shows one possibility. Click on the graph to see the full-size version.

proc freq data=sashelp.heart order=freq noprint;
   tables smoking_status*weight_status / out=FreqOut(where=(percent^=.));
ods graphics /height=500px width=800px;
title "Counts of Weight Categories by Smoking Status";
proc sgplot data=FreqOut;
  hbarparm category=smoking_status response=count / group=weight_status  
      seglabel seglabelfitpolicy=none seglabelattrs=(weight=bold);
  keylegend / opaque across=1 position=bottomright location=inside;
  xaxis grid;
  yaxis labelpos=top;
Customized stacked bar chart created by PROC SGPLOT, using the output from PROC FREQ


If you want to create a stacked bar chart or some other visualization of a two-way table, you might be tempted to immediately start using PROC SGPLOT. The purpose of this article is to remind you that SAS statistical procedures, including PROC FREQ, often create graphs as part of their output. Even if a statistical procedure cannot provide all the bells and whistles of PROC SGPLOT, it often is a convenient way to visualize your data during the preliminary stages of an analysis. If you need a highly customized graph for a final report, the SAS procedure can output the data for the graph. You can then use PROC SGPLOT or the GTL to create a customized graph.

Post a Comment