Create patterns of missing data

When simulating data or testing algorithms, it is useful to be able to generate patterns of missing data. This article shows how to generate random and systematic patterns of missing values. In other words, this article shows how to replace nonmissing data with missing data.

Create patterns of missing data in #SAS Click To Tweet

Generate a random pattern of missing values

The following SAS/IML program reads numerical data into a matrix from the Sashelp.Class data set. The matrix has 16 rows and three columns. The program then generates a matrix of the same size that contains a random pattern of zeros and ones, where about 40% of the values will be ones. The LOC function is used to find the locations of the ones, and the corresponding locations in the data are set to missing:

proc iml;
use Sashelp.Class;                    /* read numeric data into X */
read all var _NUM_ into X;
/* random assignment of missing values */
RandX = X;                                          /* copy data */
p = 0.4;                /* approx proportion of missing elements */
call randseed(1234);
B = randfun(dimension(X), "bern", p);         /* random 0s or 1s */
missIdx = loc(B=1);                       /* find position of 1s */
if ncol(missIdx)>0 then
   RandX[missIdx] = .;                /* replace 1s with missing */
print RandX;
 A ran dom pattern of missing values

In this way, you can replace a certain percentage of the data values with missing values.

Generate a systematic pattern of missing values

In the preceding section, the technique for inserting missing values does not use the fact that the matrix B is random. The technique works with any zero-one matrix B that specifies a pattern of missing values. For example, you can create a matrix that contains all combinations of zeros and ones, then use that pattern to set missing values, as follows:

C = { 0 0 0, 
      0 0 1, 
      0 1 0, 
      0 1 1, 
      1 0 0, 
      1 0 1, 
      1 1 0, 
      1 1 1 };                              /* pattern matrix */
missIdx = loc(C=1);     
SysX = X;                                        /* copy data */
if ncol(missIdx)>0 then
   SysX[missIdx] = .;              /* replace 1s with missing */
print SysX;
Systematic pattern of m issing values

You could also specify the locations of the missing values by using subscripts of the data matrix. You can use the SUB2NDX function to convert subscripts to indices.

Patterns of missing data by using the SAS DATA step

In the SAS DATA step you can use arrays to create a random pattern of missing values. For example, the following SAS data set reads numerical variables from the Sashelp.Class data and randomly assigns 40% of the data to missing values:

/* generate missing values in random locations */
data RandClass(drop=i);
call streaminit(1234);
set Sashelp.Class(keep=_NUMERIC_);
array x {*} _numeric_;
do i = 1 to dim(x);
   if rand("Bern", 0.4) then         /* p=0.4 ==> about 40% missing */
proc print; run;

The output is not shown, but the random pattern is identical to the random pattern that was generated by using SAS/IML matrices.

You could use the DATA step to specify patterns of missing values for which there is a formula, such as every fourth data value (MOD(cnt,4)=1). However, it is less easy to generate an arbitrary pattern, such as the "all combinations" pattern in the previous section. In general, I think the SAS/IML approach is easier to use and more flexible.

For any pattern of missing values, you can use PROC MI to summarize the pattern. You can also use various graphical techniques to visualize the pattern of missing data.

Post a Comment

Ahh, that's smooth! Anti-aliasing in SAS statistical graphics

I've written several articles about scatter plot smoothers: nonparametric regression curves that reveal small- and large-scale features of a response variable as a function of an explanatory variable. However, there is another kind of "smoothness" that you might care about, and that is the apparent smoothness of curves and markers that are rendered on your computer screen or other device.

Recall that anti-aliasing is a graphical technique in which some pixels along a rendered curve are set to an intermediate color, which makes a curve look smoother. For example, if a curve is being drawn by using a black pen, some of the neighboring pixels along the rendered curve are set to shades of grey, which tricks the eye into seeing a smooth curve instead of a jagged, pixellated curve.


Anti-aliasing in ODS statistical graphics

The ANTIALIAS= and ANTIALIASMAX= options were added to the ODS GRAPHICS statement in SAS 9.2. A typical usage follows:

ods graphics / antialias=on antialiasmax=5000;

The ANTIALIAS= option specifies whether to anti-alias. By default, anti-aliasing is on. Because it can be expensive to anti-alias many thousands of graphical elements, the ANTIALIASMAX= option enables you to specify the maximum number of elements (markers or curve points) that can be in a plot before anti-aliasing is disabled for that plot. The default value is ANTIALIASMAX=4000 for SAS 9.4m3. However, the default is only 600 for earlier releases, so you might want to bump up that value when you need a presentation-quality graphic that has thousands of graphical elements. If SAS disables anti-aliasing for a plot because the plot contains too many elements, the SAS log will contain a note similar to the following:

NOTE: Marker and line anti-aliasing has been disabled because
 the threshold has been reached. You can set
 ANTIALIASMAX=1000 in the ODS GRAPHICS statement to restore

Subpixel rendering in ODS graphics

A related option is turning on subpixel rendering by using the SUBPIXEL option. The SUBPIXEL option was added to the ODS GRAPHICS statement in SAS 9.4m3, but it has been available on the PROC SGPLOT statement for several 9.4 releases.

The SAS documentation for the ODS GRAPHICS statement says that the SUBPIXEL option "produces smoother curves and more precise bar spacing." There is a section in the documentation titled "Subpixel Rendering," which demonstrates the impact that subpixel rendering can have on curves and bar charts.

The documentation says that subpixel rendering "is enabled by default for image output, unless the graph contains a scatter plot or a scatter-plot matrix. In those cases, subpixel rendering is disabled by default."

For me, subpixel rendering solves a problem that I've experienced when I create a large bar chart with many categories. The number of bars, the width of the bars, and the dimensions of the graph determine whether the number of pixels between bars is uniform or whether some gaps are larger than others. Sometimes you will see small uneven gaps between the bars, as shown on the left side of the following plot. However, subpixel rendering improves the plot tremendously, as shown on the right side:


In SAS 9.4m3 and beyond, the SUBPIXEL option applies to all plot types. Prior to SAS 9.4m3, the option applied only to line charts and bar charts; see the documentation of the PROC SGPLOT statement for the specific plots that were supported.

Try it yourself

I think the best way to learn about anti-aliasing and subpixel rendering is to try it out yourself! These ODS options apply to all ODS statistical graphics, including those that are created by SAS analytical procedures. Remember, however, that the option was only available in the PROC SGPLOT statement for 9.4 releases prior to m3.

The following SAS statements enable you to play with the options and see the differences for a simple loess curve overlaid on a scatter plot:

ods graphics / reset ANTIALIAS=off;       /* anti-aliasing off */
proc sgplot data=Sashelp.ENSO;
   loess y=Pressure x=Month / smooth=0.3 degree=2;
ods graphics / ANTIALIAS=on ANTIALIASMAX=10000 SUBPIXEL=off; /* anti-aliasing on */
proc sgplot data=Sashelp.ENSO;
   loess y=Pressure x=Month / smooth=0.3 degree=2;
ods graphics / ANTIALIAS=on ANTIALIASMAX=10000 SUBPIXEL=on;  /* SAS 9.4m3 */
proc sgplot data=Sashelp.ENSO;
   loess y=Pressure x=Month / smooth=0.3 degree=2;

Further reading

The following resources provide further information about anti-aliasing and subpixel rendering in ODS graphics:

Post a Comment

Loess regression in SAS/IML

A previous post discusses how the loess regression algorithm is implemented in SAS. The LOESS procedure in SAS/STAT software provides the data analyst with options to control the loess algorithm and fit nonparametric smoothing curves through points in a scatter plot.

Although PROC LOESS satisfies 99.99% of SAS users who want to fit a loess model, some research statisticians might want to extend or modify the standard loess algorithm. Researchers like to ask "what if" questions like "what if I used a different weighting function?" or "what if I change the points at which the loess model is evaluated?" Although the loess algorithm is complicated, it is not hard to implement a basic version in a matrix language like SAS/IML.

Implement a basic version of loess regression in SAS/IML. #SAStip Click To Tweet

Implement loess regression in SAS/IML

Recent blog posts have provided some computational modules that you can use to implement loess regression. For example, the PairwiseNearestNbr module finds the k nearest neighbors to a set of evaluation points. The functions for weighted polynomial regression computes the loess fit at a particular point.

You can download a SAS/IML program that defines the nearest-neighbor and weighted-regression modules. The following call to PROC IML loads the modules and defines a function that fits a loess curve at points in a vector, t. Each fit uses a local neighborhood that contains k data values. The local weighted regression is a degree-d polynomial:

proc iml;
load  module=(PairwiseNearestNbr PolyRegEst PolyRegScore);
/* 1-D loess algorithm. Does not handle missing values.
   Input: t: points at which to fit loess (column vector)
          x, y: nonmissing data (column vectors)
          k: number of nearest neighbors used for loess fit
          d: degree of local regression model 
   Output: column vector L[i] = f(t[i]; k, d) where f is loess model
start LoessFit(t, x, y, k, d=1);
   m = nrow(t);
   Fit = j(m, 1);                            /* Fit[i] is predicted value at t[i] */
   do i = 1 to m;
      x0 = t[i];
      run PairwiseNearestNbr(idx, dist, x0, x, k);
      XNbrs = X[idx];  YNbrs = Y[idx];       /* X and Y values of k nearest nbrs */
      /* local weight function where dist[,k] is max dist in neighborhood */
      w = 32/5 * (1 - (dist / dist[k])##3 )##3; /* use tricubic weight function */
      b = PolyRegEst(YNbrs, XNbrs, w`, d);   /* param est for local weighted regression */
      Fit[i] = PolyRegScore(x0, b);          /* evaluate polynomial at x0 */
   return Fit;

This algorithm provides some features that are not in PROC LOESS. You can use this function to evaluate a loess fit at arbitrary X values, whereas PROC LOESS evaluates the function only at quantiles of the data. You can use this function to fit a local polynomial regression of any degree (for example, a zero-degree polynomial), whereas PROC LOESS fits only first- and second-degree polynomials. Although I hard-coded the standard tricubic weight function, you could replace the function with any other weight function.

On the other hand, PROC LOESS supports many features that are not in this proof-of-concept function, such as automatic selection of the smoothing parameter, handling missing values, and support for higher-dimensional loess fits.

Call the loess function on example data

Let's use polynomials of degree 0, 1, and 2 to compute three different loess fits. The LoessData data set is defined in my previous article:

use LoessData;  read all var {x y};  close;   /* read example data */
s = 0.383;                /* specify smoothing parameter */
k = floor(nrow(x)*0.383); /* num points in local neighborhood */
/* grid of points to evaluate loess curve (column vector) */
t    = T( do(min(x), max(x), (max(x)-min(x))/50) );
Fit0 = LoessFit(t, x, y, k, 0);    /* loess fit with degree=0 polynomials */
Fit1 = LoessFit(t, x, y, k, 1);    /* degree=1 */
Fit2 = LoessFit(t, x, y, k, 2);    /* degree=2 */
create Sim var {x y t Fit0 Fit1 Fit2};  append;  close;

You can use PROC SGPLOT to overlay these loess curves on a scatter plot of the data:

title "Overlay loess curves computed in SAS/IML";
proc sgplot data=Sim;
label Fit0="Loess (Deg=0)" Fit1="Loess (Deg=1)" Fit2="Loess (Deg=2)"; 
scatter x=x y=y;
series x=t  y=Fit0 / curvelabel;
series x=t  y=Fit1 / curvelabel lineattrs=(color=red);
series x=t  y=Fit2 / curvelabel lineattrs=(color=ForestGreen);
xaxis grid; yaxis grid;
Loess curves computed in SAS/IML

The three curves are fairly close to each other on the interior of the data. The degree 2 curve wiggles more than the other two curves because it uses a higher-degree polynomial. The over- and undershooting becomes even more pronounced if you use cubic or quartic polynomials for the local weighted regressions.

The curious reader might wonder how these curves compare to curves that are created by PROC LOESS or by the LOESS statement in PROC SGPLOT. In the attached program I show that the IML implementation produces the same predicted values as PROC LOESS when you evaluate the models at the same set of points.

Most SAS data analysts are happy to use PROC LOESS. They don't need to write their own loess algorithm in PROC IML. However, this article shows that IML provides the computational tools and matrix computations that you need to implement sophisticated algorithms, should the need ever arise.

Post a Comment

What is loess regression?

Scatter Plot with Loess Smoother

Loess regression is a nonparametric technique that uses local weighted regression to fit a smooth curve through points in a scatter plot. Loess curves are can reveal trends and cycles in data that might be difficult to model with a parametric curve. Loess regression is one of several algorithms in SAS that can automatically choose a smoothing parameter that best fits the data.

In SAS, there are two ways to generate a loess curve. When you want to see statistical details for the fit, use the LOESS procedure. If you just want to overlay a smooth curve on a scatter plot, you can use the LOESS statement in PROC SGPLOT.

This article discusses the 1-D loess algorithm and shows how to control features of the loess regression by using PROC LOESS and PROC SGPLOT. You can also use PROC LOESS to fit higher dimensional data; the PROC LOESS documentation shows an example of 2-D loess, which fits a response surface as a function of two explanatory variables.

Overview of the loess regression algorithm

The loess algorithm, which was developed by Bill Cleveland and his colleagues in the late '70s through the 'early 90s, has had several different incarnations. Assume that you are fitting the loess model at a point x0, which is not necessarily one of the data values. The following list describes the main steps in the loess algorithm as implemented in SAS:

  1. Choose a smoothing parameter: The smoothing parameter, s, is a value in (0,1] that represents the proportion of observations to use for local regression. If there are n observations, then the k = floor(n*s) points closest to x0 (in the X direction) form a local neighborhood near x0.
  2. Find the k nearest neighbors to x0: I recently showed a SAS/IML module that can compute nearest neighbors.
  3. Assign weights to the nearest neighbors: The loess algorithm uses a tricubic weight function to weight each point in the local neighborhood of x0. The weight for the i_th point in the neighborhood is
    wi = (32/5) (1- (di / D)3 )3
    where D is the largest distance in the neighborhood and di is the distance to the i_th point. (The weight function is zero outside of the local neighborhood.) The graph of the weight function is shown below: Tricubic weight function for loess regression
    The weight function gives more weight to observations whose X value is close to x0 and less weight to observations that are farther away.
  4. Perform local weighted regression: The points in the local neighborhood of x0 are used to fit and score a local weighted regression model at x0.

These four steps implement the basic loess method. The SAS procedures add a fifth step: optimize the smoothing parameter by fitting multiple loess models. You can use a criterion such as the AICC or GCV to balance the tradeoff between a tight fit and a complex model. For details, see the documentation for selecting the smoothing parameter.

How to score a loess regression model

The previous section told you how to fit a loess model at a particular point x0. PROC LOESS provides two choices for the locations at which you can evaluate the model:

  • By default, PROC LOESS evaluates the model at a data-dependent set of points, V, which are vertices of a k-d tree. Think of the points of V as a grid of X values. However, the grid is not linearly spaced in X, but is approximately linear in the quantiles of the data.
  • You can evaluate the model at each unique X data value by using the DIRECT option on the MODEL statement.

If you want to score the model on a set of new observations, you cannot use the direct method. When you score new observations by using the SCORE statement, PROC LOESS uses linear or cubic interpolation between the points of V and the new observations. You can specify the interpolation scheme by using the INTERP= option on the MODEL statement.

Comparing PROC LOESS and the LOESS statement in PROC SGPLOT

The MODEL statement for the LOESS procedure provides many options for controlling the loess regression model. The LOESS statement in PROC SGPLOT provides only a few frequently used options. In some instances, PROC SGPLOT uses different default values, so it is worthwhile to compare the two statements.

  • Choose a smoothing parameter: In both procedures, you can choose a smoothing parameter by using the SMOOTH= option.
  • Fit the local weighted regression: In both procedures, you can control the degree of the local weighted polynomial regression by using the DEGREE= option. You can choose a linear or a quadratic regression model. Both procedures use the tricubic function to determine weights in the local neighborhood.
  • Choose an optimal smoothing parameter: PROC LOESS provides the SELECT= option for controlling the selection of the optimal smoothing parameter. PROC SGPLOT does not provide a choice: it always optimizes the AICC criterion with the PRESEARCH suboption.
  • Evaluate the fit: Both procedures evaluate the fit at a set of data-dependent values, then uses interpolation to evaluate the fit at other locations.
    • In PROC LOESS, you can use the SCORE statement to interpolate at an arbitrary set of points. You use the INTERP= option in the MODEL statement to specify whether to use linear or cubic interpolation.
    • In PROC SGPLOT, the interpolation is performed on a uniform grid of points. The default grid contains 201 points between min(x) and max(x), but you can use the MAXPOINTS= option to change that number. You use the INTERPOLATION= option to specify linear or cubic interpolation.

A loess example in SAS

The following SAS DATA step creates 30 observations for X and Y variables. The call to PROC LOESS creates a loess curve to the data and creates a fit plot, a residual plot, and a panel of diagnostic plots. Only the fit plot is shown:

data LoessData;
input x y @@;
11.7  2.3  19.9  8.1  11.8  4.6  17.1  5.1  16.5  4.8
 5.6  1.7  12.9  5.4   7.6  3.0   9.0  4.8  17.5  5.0
10.4  1.3  16.9  2.8   5.6  1.8  18.7  6.9   3.7  1.7
 7.4  2.3   2.0  2.7  14.8  5.2   3.0  0.0  16.8  4.2
15.0  6.6  19.9  5.5   1.9  1.1  14.8  5.8  12.4  3.0
14.0  6.4  11.7  3.6   8.2  2.9  18.8  6.2   0.3  1.8
ods graphics on;
ods select FitPlot;
proc loess data=LoessData plots=FitPlot;
model y = x / interp=linear           /* LINEAR or CUBIC */
              degree=1                /* 1 or 2 */
              select=AICC(presearch); /* or SMOOTH=0.383 */

For the PROC LOESS call, all options are the default values except for the PRESEARCH suboption in the SELECT= option. You can create the same fit plot by using the LOESS statement in PROC SGPLOT. The default interpolation scheme in PROC SGPLOT is cubic, so the following statements override that default option:

title "PROC SGPLOT with LOESS Statement";
proc sgplot data=LoessData  noautolegend;
loess x=x y=y / interpolation=linear  /* CUBIC or LINEAR */
                degree=1              /* 1 or 2 */
                ;  /* default selection or specify SMOOTH=0.383 */
xaxis grid; yaxis grid;
Comparison of loess regression curves in SAS: PROC LOESS versus PROC SGPLOT

The two plots are shown side by side. The one on the left was created by PROC LOESS. The one on the right was created by PROC SGPLOT.

In conclusion, SAS provides two ways to overlay a smooth loess curve on a scatter plot. You can use PROC LOESS when you want to see the details of statistical aspects of the fit and the process that optimizes the smoothing parameter. You can use the SGPLOT procedure when you care less about the details, but simply want an easy way to show a nonlinear relationship between a response and an explanatory variable.

Post a Comment

The empty-space distance plot

How far away is the nearest hospital? How far is the nearest restaurant? The nearest gas station? These are commonly asked questions whose answers depend on the location of the person asking the question.

Recently I showed an algorithm that enables you to find the distance between a set of locations and a fixed set of reference sites. The set of locations (the people) can be arbitrary, so a good way to summarize the problem is to draw a contour plot that shows the distance from every point in a region to the nearest reference site (a hospital or store). In SAS, you can create this graph by using the SPP procedure, which analyzes spatial point processes. The graph is called an empty-space plot.

Example: The location of large US cities

Some people love big cities and the diverse opportunities that they provide. They want to know how far it is to the nearest big city. Other people hate the noise and congestion; they want to be as far away from a big city as possible. To demonstrate how to create an empty-space plot, let's look at the spatial distribution of US cities that have population more than 200,000.

The following SAS DATA step reads data from the Sashelp.USCity data set, which is distributed as part of SAS/GRAPH software. The BigCities data contains only cities in the contiguous US for which the population is greater than 200,000. The call to PROC SGPLOT creates a scatter plot that shows the projected longitude and latitude (measures in radians) of these 86 cities.
data BigCities;
set maps.uscity;                          /* part of SAS/GRAPH */
where statecode not in ("AK" "HI" "PR") & pop > 200000;
if pop>500000 then Name=City;             /* label really big cities */
else Name = "        ";
title  "Large US Cities";
title2 "Population > 200,000";
proc sgplot data=BigCities;
  scatter x=x y=y / markerattrs=(symbol=CircleFilled) datalabel=Name;
Locations of US cities with population greater than 200,000

The scatter plot supports a well-known fact: cities are not settled in random locations. People tend to settle near beneficial geographic features, such as proximity to rivers, harbors, and other strategic resources. Therefore the cities are not spread uniformly throughout the country. There are clusters of cities on both coasts and there is a large empty space in the rugged northwest. The large empty space contain locations that are much farther away from a big city than would be expected for a random uniform assortment of 86 cities.

The following call to PROC SPP create an empty-space plot, which reveals regions that are far from a big city. The F option will be explained in a moment:

proc spp data=BigCities plots(equate)=(emptyspace(obs=on) F);
   process BigCities = (x, y / area=(-0.37,-0.19,0.32, 0.24)) / F;
Empty-space plot for large US cities, created in SAS

The dark blue color indicates geographic coordinates that are far from any large city. The darkest blue appears to be in the northwestern states, such as Montana, Idaho, and Wyoming. Western Texas also stands out. Those are great places to live for people who want to be far from large cities. There is also dark blue in regions that are uninhabitable (oceans and the Great Lake) or are not part of the US (Mexico and Canada).

I think the image is beautiful, but it is useful as well. It gives a visual summary of locations that are far from the reference sites (big cities).

Empty-space distances

F function for average spatial ditances, computed with PROC SPP in SAS

If you want a more statistical analysis, the F option in the PROCESS statement and in the PLOTS= option tells PROC SPP to lay down a regular grid of points and compute the distance from each grid position to the nearest big city. (By default the grid is 50 x 50.) The graph shows the cumulative distribution of these distances in blue. You can compare the distribution of these distances to the expected distribution for randomly selected locations, which is shown in red and labeled "complete spatial randomness."

You can see that the empirical distribution rises more slowly than the theoretical curve. This indicates that the observed locations are more clustered than you would expect if the cities were randomly located. For details about the F option, see the documentation for PROC SPP.

The main purpose of this example is to demonstrate that PROC SPP makes it easy to compute the empty-space plot, which summarizes the distance between an arbitrary location and a pattern of points. In this article the points were large US cities, but obviously you could analyze other point patterns. The SPP procedure has many options that indicate whether the pattern appear to be randomly distributed or clustered. The procedure indicates that the locations of large US cities are not random.

Post a Comment

WHERE operators in SAS: Multiple comparisons and fuzzy matching

The WHERE clause in SAS is a powerful mechanism for selecting observations as you read or write a data set. The WHERE clause supports many operators, including the IN operator, which enables you to compactly specify multiple conditions for a categorical variable.

A common use of the IN operator is to specify a list of US states and territories that should be included or excluded in an analysis. For example, the following DATA step reads the Sashelp.Zipcode data, but excludes zip codes for the states of Alaska ("AK"), Hawaii ("HI"), and US territories such as Puerto Rico ("PR"), Guam ("GU"), and so on:

data Lower48;
set Sashelp.Zipcode(where=(   /* exclude multiple states and territories */
    Statecode not in ("AK" "HI" "VI" "GU" "FM" "MP" "MH" "PW"))

WHERE operators in SAS/IML are vectorized

In my previous article about how to use the WHERE clause in SAS/IML, my examples used scalar comparisons such as where(sex="F") to select only females in the data. The SAS/IML language does not support the IN operator, but there is another compact way to include or exclude multiple values. Because SAS/IML is a matrix-vector language, many operations support vector arguments. In particular, the WHERE clause in the SAS/IML language enables you to use the ordinary equal operator (=) and specify a vector of values on the right hand side!

For example, the following statement reads in all US zip codes in the contiguous US and creates a scatter plot of their locations:

proc iml;
excludeList = {"AK" "HI" "PR" "VI" "GU" "FM" "MP" "MH" "PW"};
use Sashelp.Zipcode where(Statecode ^= excludeList);   /* vector equiv of NOT IN */
read all var {X Y "City" "Statecode"};
title "Centers of US ZIP Codes";
call scatter(X, Y) group=Statecode option="markerattrs=(size=2)" 
     label={"Longitude" "Latitude"} procopt="noautolegend";
ZIP code locations filtered by a WHERE clause in SAS/IML

The WHERE clause skips observations for which the Statecode variable matches any of the values in the excludeList vector. The scatter plot reveals the basic shape of the contiguous US. You can see that the plot does not display locations from Alaska, Hawaii, or US territories.

String matching operators in the SAS WHERE clause

As long as we're talking about the WHERE clause in SAS, let's discuss some string-matching operators that might not be familiar to some SAS programmers. I'll use SAS/IML for the examples, but these operators are generally supported (in scalar form) in all SAS WHERE clauses. The operators are

  • The "contains" operator (?)
  • The "not contains" operator (^?)
  • The "begins with" operator (=:)
  • The "sounds like" operator (=*)

All these operators are documented in the list of WHERE clause operators in SAS/IML.

WHERE operators in #SAS: string matching and fuzzy matching Click To Tweet

The "contains" operator (?) and the "not contains" operator (^?) match a substring that appears anywhere in the target character variable.

The "begins with" operator (=:) matches substrings that appear at the beginning of a target variable. For example, the following statements select observations for which the state begins with the letter "B", "C", or "D". (There are no US states that begin with "B.") Notice that the "begin with" operator is also vectorized in SAS/IML:

use Sashelp.Zipcode where(Statecode =: {"B" "C" "D"});  /* =:  "begins with" */
read all var {X Y "City" "Statecode"};
u = unique(Statecode);
print u;

Fuzzy matching of English words

Perhaps the most unusual operator in the WHERE clause in SAS is the "sounds like" operator (=*), which does "fuzzy matching" of English words. The operator finds English words that are similar to the specified target words by using the SOUNDEX function in SAS. The SOUNDEX function is often used to select different names that sound alike but have different spelling, such as "John" and "Jon" or "Lynn" and "Lynne." In the following WHERE clause, the "sounds like" operator is used to select observations for which the city sounds similar to either "Cary" or "Asheville." The selected observations are plotted in a scatter plot after eliminating duplicate rows.

use Sashelp.Zipcode where(City =* {"Cary" "Asheville"}); /* =*  "sounds like" */
read all var {X Y "City" "Statecode"};
start UniqueRows(x);            
   cols = 1:ncol(x);               /* sort by all columns */
   call sortndx(ndx, x, cols);     /* ndx = permutation of rows that sorts matrix */
   uRows = uniqueby(x, cols, ndx); /* locate unique rows of sorted matrix */
   return ( ndx[uRows] );          /* rows in original matrix */
r = UniqueRows(City||Statecode);   /* get row numbers in x for unique rows */
call scatter(X[r], Y[r]) group=Statecode[r] datalabel=City[r]
   option="markerattrs=(symbol=CircleFilled)" procopt="noautolegend";
Citites that sound like Cary or Asheville, filtered by a WHERE clause in SAS/IML

According to the "sounds like" operator, the name "Cary" sounds like "Carey," "Cory," and "Cherry." The name "Asheville" sounds like "Ashville," "Ashfield," and "Ash Flat."

In summary, the WHERE clause in SAS/IML works a little differently than the more-familiar version in the SAS DATA step. Both versions enable you to selectively include or exclude observations that satisfy one or more conditions. However, the SAS/IML WHERE clause is vectorized. You can specify a vector of conditions for operators, thus reproducing the functionality of the IN operator.

This article also demonstrates a few lesser-known string operators, such as "contains" (?), "not contains" (^?), "begins with" (=:), and "sounds like" (=*).

Post a Comment

Visualize a weighted regression

What is weighted regression? How does it differ from ordinary (unweighted) regression? This article describes how to compute and score weighted regression models.

Visualize a weighted regression

Technically, an "unweighted" regression should be called an "equally weighted " regression since each ordinary least squares (OLS) regression weights each observation equally. Similarly, an "unweighted mean" is really an equally weighted mean.

Recall that weights are not the same as frequencies. When talking about weights, it is often convenient to assume that the weights sum to unity. This article uses standardized weights, although you can use specify any set of weights when you use a WEIGHT statement in a SAS procedure.

One way to look at a weighted regression is to assume that a weight is related to the variance of an observation. Namely, the i_th weight, wi, indicates that the variance of the i_th value of the dependent variable is σ2 / wi, where σ2 is a common variance. Notice that an observation that has a small weight (near zero) has a relatively large variance. Intuitively, the observed response is not known with much precision; a weighted analysis makes that observation less influential.

Weighted regression assumes that some response values are more precise than others. #StatWisdom Click To Tweet

For example, the following SAS data set defines (x,y) values and weights (w) for 11 observations. Observations whose X value is close to x=10 have relatively large weights. Observations far from x=10 have small weights. The last observation (x=14) is assigned a weight of zero, which means that it will be completely excluded from the analysis.

data RegData;
input y x w;
2.3  7.4  0.058 
3.0  7.6  0.073 
2.9  8.2  0.114 
4.8  9.0  0.144 
1.3 10.4  0.151 
3.6 11.7  0.119 
2.3 11.7  0.119 
4.6 11.8  0.114 
3.0 12.4  0.073 
5.4 12.9  0.035 
6.4 14.0  0 

You can use PROC SGPLOT to visualize the observations and weights. In fact, PROC SGPLOT supports a REG statement that adds a regression line to the plot. The REG statement supports adding both weighted and unweighted regression lines:

proc sgplot data=RegData;
scatter x=x y=y / filledoutlinedmarkers markerattrs=(size=12 symbol=CircleFilled)
                colorresponse=w colormodel=TwoColorRamp;     /* color markers by weight */
reg x=x y=y / nomarkers legendlabel="Unweighted Regression"; /* usual OLS regression */
reg x=x y=y / weight=w nomarkers legendlabel="Weighted Regression"; /* weighted regression */
xaxis grid; yaxis grid;
Weighted regression curves in PROC SGPLOT

Each marker is colored by its weight. Dark blue markers (observations for which 8 < x < 12) have relatively large weights. Light blue markers have small weights and do not affect the weighted regression model very much.

You can see the effect of the weights by comparing the weighted and unweighted regression lines. The unweighted regression line (in blue) is pulled upward by the observations near x=13 and x=14. These observations have small weights, so the weighted regression line (in red) is not pulled upwards.

Weighted regression by using PROC REG

If you want to compute parameter estimates and other statistics, you can use the REG procedure in SAS/STAT to run the weighted and unweighted regression models. The following statement run PROC REG twice: once without a WEIGHT statement and once with a WEIGHT statement:

proc reg data=RegData plots(only)=FitPlot;
   Unweighted: model y = x;
   ods select NObs ParameterEstimates  FitPlot;
proc reg data=RegData plots(only)=FitPlot;
   Weighted: model y = x;
   weight w;
   ods select NObs ParameterEstimates  FitPlot;

The output (not shown) indicates that the unweighted regression model is Y = -0.23 + 0.36*X. In contrast, the weighted regression model is Y = 2.3 + 0.085*X. This confirms that the slope of the weighted regression line is smaller than the slope of the unweighted line.

A weighted regression module in SAS/IML

You can read the SAS documentation to find the formulas that are used for a weighted OLS regression model. The formula for the parameter estimates is a weighted version of the normal equations: b = (X`WX)-1(X`WY), where Y is the vector of observed responses, X is the design matrix, and W is the diagonal matrix of weights.

In SAS/IML it is more efficient to use elementwise multiplication than to multiply with a diagonal matrix. If you work through the matrix equations, you will discover that weighted regression is easily accomplished by using the normal equations on the matrices that result from multiplying the rows of Y and X by the square root of the weights. This is implemented in the following SAS/IML module:

proc  iml;
/* weighted polynomial regression for one regressor. Y, X, and w are col vectors */
start PolyRegEst(Y, X, w, deg);
   Yw = sqrt(w)#Y;                  /* mult rows of Y by sqrt(w) */
   XDesign = j(nrow(X), deg+1);
   do j = 0 to deg;                 /* design matrix for polynomial regression */
      Xdesign[,j+1] = X##j;
   Xw = sqrt(w)#Xdesign;            /* mult rows of X by sqrt(w) */
   b = solve(Xw`*Xw, Xw`*Yw);       /* solve normal equation */
   return b;
use RegData; read all var {Y X w}; close;    /* read data and weights */
/* loop to perform one-variable polynomial regression for deg=0, 1, and 2 */
do deg = 0 to 2;                    
   b = PolyRegEst(Y, X, w, deg);
   d = char(deg,1);                          /* '0', '1', or '2' */
   labl = "Estimates (deg=" + d + ")";
   print b[L=labl rowname=("b0":("b"+d))];   /* print estimates for each model */
Parameter estimates for weighted regression models

The output shows the parameter estimates for three regression models: a "mean model" (degree 0), a linear model (degree 1), and a quadratic model (degree 2). Notice that the parameter estimates for the weighted linear regression are the same as estimates computed by PROC REG in the previous section.

Score the weighted regression models

The previous section describes how to use SAS/IML to compute parameter estimates of weighted regression models, and you can also use SAS/IML to score the regression models. The scoring does not require knowing the weight variable, only the parameter estimates. The following module uses Horner's method to evaluate a polynomial on a grid of points:

/* Score regression fit on column vector x */
start PolyRegScore(x, coef);
   p = nrow(coef);
   y = j(nrow(x), 1, coef[p]);            /* initialize to coef[p] */
   do j = p-1 to 1 by -1; 
      y = y # x + coef[j];

You can compute predicted values for each model on a grid of points. You can then write the predicted values to a SAS data set and combine the predicted values and the original data. Lastly, you can use the SERIES statement in PROG SGPLOT to overlay the three regression models on the original data:

t = T( do(min(x), max(x), (max(x)-min(x))/25) ); /* uniform grid in range(X) */ 
Yhat = j(nrow(t), 3);
do d = 0 to 2;
   b = PolyRegEst(Y, X, w, d);                 /* weighted regression model of degree d */
   Yhat[,d+1] = PolyRegScore(t, b);            /* score model on grid */
Z = t || Yhat;                                 /* write three predicted curves to data set */
create RegFit from Z[c={"t" "Pred0" "Pred1" "Pred2"}];
append from Z;
data RegAll;                                   /* merge predicted curves with original data */
label w="Weight" Pred0="Weighted Mean" 
      Pred1="Weighted Linear Fit" Pred2="Weighted Quadratic Fit";
merge RegData RegFit;
title "Weighted Regression Models";            /* overlay weighted regression curves and data */
proc sgplot data=RegAll;
scatter x=x y=y / filledoutlinedmarkers markerattrs=(size=12 symbol=CircleFilled)
                  colorresponse=w colormodel=TwoColorRamp;
series x=t  y=Pred0 / curvelabel;
series x=t  y=Pred1 / curvelabel;
series x=t  y=Pred2 / curvelabel;
xaxis grid; yaxis grid;
Predicted values for weighted regression models

A visualization of the weighted regression models is shown to the left. The weighted linear fit is the same line that was shown in the earlier graph. The weighted mean and the weighted quadratic fit are the zero-degree and second-degree polynomial models, respectively. Of course, you could also create these curves in SAS by using PROC REG or by using the REG statement in PROC SGPLOT.


Weighted regression has many applications. The application featured in this article is to fit a model in which some response values are known with more precision than others. We saw that it is easy to create a weighted regression model in SAS by using PROC REG or PROC SGPLOT. It is only slightly harder to write a SAS/IML function to use matrix equations to "manually" compute the parameter estimates. No matter how you compute the model, observations that have relatively large weights are more influential in determining the parameter estimates than observations that have small weights.

Post a Comment

Let PROC FREQ create graphs of your two-way tables

The recent releases of SAS 9.4 have featured major enhancements to the ODS statistical graphics procedures such as PROC SGPLOT. In fact, PROC SGPLOT (and the underlying Graph Template Language (GTL)) are so versatile and powerful that you might forget to consider whether you can create a graph automatically by using a SAS statistical procedure. For example, when you turn on ODS graphics (ODS GRAPHICS ON), SAS procedures create the following graphs automatically:

  1. Many SAS regression procedures (and PROC PLM) create effect plots.
  2. PROC SURVEYREG creates a hexagonal bin plot.
  3. PROC REG creates heat maps when a scatter plot would suffer from overplotting.
  4. PROC LOGISTIC creates odds-ratio plots.

Let PROC FREQ visualize your two-way tables

Recently a SAS customer asked how to use PROC SGPLOT to produce a stacked bar chart. I showed him how to do it, but I also mentioned that the same graph could be produced with less effort by using PROC FREQ.

PROC FREQ is a workhorse procedure that can create dozens of graphs. For example, PROC FREQ can create a mosaic plot and a clustered bar chart to visualize frequencies and relative frequencies in a two-way table. Next time you use PROC FREQ, add the PLOTS=ALL option to the TABLES statement and see what you get!

One of my favorite plots for two-way categorical data is the stacked bar chart. As I told the SAS customer, it is simple to create a stacked bar chart in PROC FREQ. For example, the following statements create a horizontal bar chart that orders the categories by frequency:

proc freq data=sashelp.Heart order=freq;
   tables weight_status*smoking_status / 
       plots=freqplot(twoway=stacked orient=horizontal);
Stacked bar chart created by PROC FREQ

See the documentation for the PLOTS= option in the TABLES statement for a description of all the plots that PROC FREQ can create. PROC FREQ creates many plots that are associated with a particular analysis, such as the "deviation plot," which shows the relative deviations between the observed and expected counts when you request a chi-square analysis of a one-way table.

Do you need a highly customized graph?

The ODS graphics in SAS are designed to create many—but not all—of the visualizations that are relevant to an analysis. Some attributes of a graph (for example, the title and the legend placement) are determined by a stored template and can't be modified by using the procedure syntax. Advanced GTL gurus might want to learn how to edit the ODS templates. Less ambitious users might choose to use a statistical procedure to automatically create graphs during data exploration and modeling, but then use PROC SGPLOT to create the final graph for a report.

For example, the following call to PROC FREQ writes the two-way frequency counts to a SAS data set. From the data you can create graphs that are similar to the one that PROC FREQ creates, but you can change the order and colors of the bars, alter the placement of the legend, add text, and more. The following call to PROC SGPLOT shows one possibility. Click on the graph to see the full-size version.

proc freq data=sashelp.heart order=freq noprint;
   tables smoking_status*weight_status / out=FreqOut(where=(percent^=.));
ods graphics /height=500px width=800px;
title "Counts of Weight Categories by Smoking Status";
proc sgplot data=FreqOut;
  hbarparm category=smoking_status response=count / group=weight_status  
      seglabel seglabelfitpolicy=none seglabelattrs=(weight=bold);
  keylegend / opaque across=1 position=bottomright location=inside;
  xaxis grid;
  yaxis labelpos=top;
Customized stacked bar chart created by PROC SGPLOT, using the output from PROC FREQ


If you want to create a stacked bar chart or some other visualization of a two-way table, you might be tempted to immediately start using PROC SGPLOT. The purpose of this article is to remind you that SAS statistical procedures, including PROC FREQ, often create graphs as part of their output. Even if a statistical procedure cannot provide all the bells and whistles of PROC SGPLOT, it often is a convenient way to visualize your data during the preliminary stages of an analysis. If you need a highly customized graph for a final report, the SAS procedure can output the data for the graph. You can then use PROC SGPLOT or the GTL to create a customized graph.

Post a Comment

Distances between observations in two groups

Last week I showed how to find the nearest neighbors for a set of d-dimensional points. A SAS user wrote to ask whether something similar could be done when you have two distinct groups of points and you want to find the elements in the second group that are closest to each element in the first group.

The answer is yes, and this problem occurs frequently in applications. Typically the first group contains the locations of randomly placed individuals and the second contains the locations of resources. For each individual, you want to find the closest resources. Examples include:

  • The first group contains the positions of houses. The second group contains the locations of stores. A retailer might want to identify the stores that are closest to customers, so that shipping costs are minimized.
  • The first group contains the locations of a vehicle accidents on a highway. The second group contains the locations of hospitals who can treat the injured victims.
  • The first group contains the location of cell phone users. The second group contains the locations of cell towers.

This article solves the problem in a straightforward way: compute all the pairwise distances between the points in each group. Distances within groups are not needed, so they are not computed.

Distances between observations in two groups Click To Tweet

Compute the pairwise distances between points

I previously showed how to compute the pairwise distance between points in different sets. Let S be a set of n d-dimensional points and let R be another set of m points. The PairwiseDist function in SAS/IML (shown below) returns an n x m matrix, D, of distances such that D[i,j] is the distance from the i_th point in S to the j_th point in R. The PairwiseNearestNbr function is the same algorithm that was used to find nearest neighbors, so I will not re-explain how the function works. Suffice it to say that it returns two matrices: a matrix of row numbers and a matrix or distances.

proc iml;
/* */
/* compute Euclidean distance between points in S and points in R.
   S is a n x d matrix, where each row is a point in d dimensions.
   R is a m x d matrix.
   The function returns the n x m matrix of distances, D, such that
   D[i,j] is the distance between S[i,] and R[j,].
start PairwiseDist(S, R);
   if ncol(S)^=ncol(R) then return (.);       /* different dimensions */
   n = nrow(S);  m = nrow(R);
   idx = T(repeat(1:n, m));                   /* index matrix for S   */
   jdx = shape(repeat(1:m, n), n);            /* index matrix for R   */
   diff = S[idx,] - R[jdx,];
   return( shape( sqrt(diff[,##]), n ) );     /* sqrt(sum of squares) */
/* Compute indices (row numbers) of k nearest neighbors.
   INPUT:  S    an (n x d) data matrix
           R    an (m x d) matrix of reference points
           k    specifies the number of nearest neighbors (k>=1) 
   OUTPUT: idx  an (n x k) matrix of row numbers. idx[,j] contains the
                row numbers (in R) of the j_th closest elements to S
           dist an (n x k) matrix. dist[,j] contains the distances
                between S and the j_th closest elements in R
start PairwiseNearestNbr(idx, dist, S, R, k=1);
   n = nrow(S);
   idx = j(n, k, .);
   dist = j(n, k, .);
   D = PairwiseDist(S, R);      /* n x m */
   do j = 1 to k;
      dist[,j] = D[ ,><];       /* smallest distance in each row */
      idx[,j] = D[ ,>:<];       /* column of smallest distance in each row */
      if j < k then do;         /* prepare for next closest neighbors */
         ndx = sub2ndx(dimension(D), T(1:n)||idx[,j]);
         D[ndx] = .;            /* set elements to missing */

An example of finding nearest points

To illustrate how you can use these functions, consider two sets S and R. The set R is the set of resources locations (stores, hospitals, towers,...). For this example, R contains four points that are the vertices of a square of side length 2 centered at the origin. For each point in S, the following SAS/IML statements compute the closest vertex in R and the second-closest-vertex in R:

/* R = vertices of a square */
R = {-1 -1,      /* Pt 1: lower left corner   */
      1 -1,      /* Pt 2: lower right corner  */
      1  1,      /* Pt 3: upper  right corner */
     -1  1};     /* Pt 4: upper left corner   */
S = {-0.3 -0.2,  /* points inside square */
      0.9  0.6,
     -0.5  0.8};
k = 2;          /* find nearest and second nearest points */
run PairwiseNearestNbr(idx, dist, S, R, k);
print idx[c={"Closest" "2nd Closest"} r=("S1":"S3")];
Points in a reference group that are closest to points in another group

The first column in the table shows the indices (row numbers) of the vertices that are nearest to each point of S. The second column shows the second-closest vertices. For example, Pt 1 (the lowest left corner) is the nearest vertex to the first row of S, and Pt 4 (upper left corner) is the second nearest. In a similar way, Pt 3 is the closest vertex to the second row of S, and Pt 4 is the closest vertex to the third row of S.

The PairwiseNearestNbr module also returns the dist matrix, which contains the corresponding distances. The i_th row of dist contains the distances from the i_th row of S to the two closest vertices.

Coloring points by the closest reference point

In a similar way, you can generate random points in the square and color each point in a scatter plot according to the nearest point in the reference set. The ODS GRAPHICS statement with ATTRPRIORITY=NONE forces the ODS style to cycle though colors and symbols.

ods graphics / attrpriority=NONE width=400px height=400px; /* cycle symbols and colors */
/* Generate 500 random points in square [-1,1] x [-1,1] */
call randseed(12345);
S = R // randfun({500 2}, "Uniform", -1, 1);
k = 2;
run PairwiseNearestNbr(idx, dist, S, R, k);
Corner = idx[,1];                            /* index of nearest vertex */
title "Points colored by nearest corner";
call scatter(S[,1], S[,2]) group=Corner grid={x y} procopt="aspect=1";
Random points colored according to the nearest vertex of a square

The scatter plot shows what you already knew: points in the first quadrant (green Xs) are closest to the vertex at (1,1). Points in the second quadrant (brown triangles) are closest to the vertex at (-1, 1), and so forth. Any points that are equidistant from two or more vertices (such as the origin) are assigned one of the colors arbitrarily. The points were sorted (not shown) so that the order in the legend agrees with the order of the points in R.

The call to the PairwiseNearestNbr subroutine used k=2, which requests the closest and the second closest points. The second-closest indices are located in the second column of the idx matrix. If you change the title and set Corner = idx[,2], you can create a scatter plot in which each point is colored by the second-closest vertex. That coloration is shown in the scatter plot below. See if you can understand why each marker in the following scatter plot is colored the way it is. For example, for the point in the first quadrant, their second-closest vertex is either the second vertex (1, -1) or the fourth vertex (-1, 1).

Random points colored according to the second-nearest vertex of a square

In summary, last week I showed how you can find the nearest neighbors among a single set of observations. This article solves a similar problem in which observations are split into two different groups. The functions in this article find the nearest observation in a "reference" or "resource" group for each observation in another group.

I'll conclude by mentioning that you can use these computations to compute the nearest distance between two groups of observations. Simply use k=1 and take the minimum distance that is computed by the PairwiseNearestNbr subroutine. This gives the smallest distance between any point in the first group and any point in the second group.

Post a Comment

Create an ogive in SAS

My son is taking an AP Statistics course in high school this year. AP Statistics is one of the fastest-growing AP courses, so I welcome the chance to see topics and techniques in the course. Last week I was pleased to see that they teach data exploration techniques, such as using an ogive (rhymes with "slow jive") to approximate the cumulative distribution of a set of univariate data. This article shows how you can create an ogive by using SAS procedures.

#StatWisdom: How to create an ogive (rhymes with 'slow jive') Click To Tweet

What is an ogive?

An ogive is also called a cumulative histogram. You can create an ogive from a histogram by accumulating the frequencies (or relative frequencies) in each histogram bin. The height of an ogive curve at x is found by summing the heights of the histogram bins to the left of x.

A histogram estimates the density of a distribution; the ogive estimates the cumulative distribution. Both are easy to construct by hand. Both are coarse estimates that depend on your choice of a bin widths and anchor position.

Ogives are not used much by professional statisticians because modern computers make it easy to compute and visualize the exact empirical cumulative distribution function (ECDF). However, if you are a student learning to analyze data by hand, an ogive is an easy way to approximate the ECDF from binned data. They are also important if you do not have access to the original data, but you have a histogram that appeared in a published report. (See "How to approximate a distribution from published quantiles.")

Create an ogive from a histogram

To demonstrate the construction of an ogive, let's consider the distribution of the MPG_CITY variable in the Sashelp.Cars data set. This variable contains the reported fuel efficiency (in miles per gallon) for 428 vehicle models. The following call to PROC UNIVARIATE in Base SAS uses the OUTHIST= option in the HISTOGRAM statement to create a data set that contains the frequencies and relative frequencies of each bin. By default, the frequencies are reported for the midpoints of the intervals. To create an ogive you need the endpoints of each bin, so use the ENDPOINTS option as follows:

proc univariate;
var mpg_city;
histogram mpg_city / grid vscale=proportion ENDPOINTS OUTHIST=OutHist;
/* cdfplot mpg_city / vscale=proportion; */ /* optional: create an ECDF plot */
Distribution of miles per gallon in 428 vehicles

The histogram shows that most vehicles get between 15 and 25 mpg in the city. The distribution is skewed to the right, with a few vehicles getting as much as 59 or 60 mpg. A few gas-guzzling vehicles get less than 15 mpg.

You can construct an ogive from the relative frequencies in the 11 histogram bins. The height of the ogive at x=10 (the leftmost endpoint in the histogram) is zero. The height at x=15 is the height of the first bar. The height at x=20 is the sum of the heights of the first two histogram bars, and so on.

Create an ogive from the output of PROC UNIVARIATE

Each row in the OutHist data set contains a left-hand endpoint and the relative frequency (height) of the bar. However, to construct an ogive, you need to associate the bar height with the right-hand endpoints. This is because at the left-hand endpoint none of the density for the bin has accumulated, and for the right-hand endpoint all of the density has accumulated.

Consequently, to construct an ogive from the OUTHIST= data set, you can do the following:

  • Associate zero with the leftmost endpoint of the bins.
  • Adjust the counts and proportions in the OutHist data so that they are associated with the right-hand endpoint of each bin. You can use the LAG function to do this.
  • Accumulate the relative frequencies in each bin to form the cumulative frequencies.
  • Add a new observation to the OutHist data that contains the rightmost endpoint of the bins.

The following SAS DATA step carry out these adjustments:

data Ogive;
set outhist end=EOF;
ogiveX = _MinPt_;        /* left endpoint of bin */
dx = dif(ogiveX);        /* compute bin width */
prop = lag(_OBSPCT_);    /* move relative frequency to RIGHT endpoint */
if _N_=1 then
   prop = 0;             /* replace missing value by 0 for first obs */
cumProp + prop/100;      /* accumulate proportions */
if EOF then do;          /* append RIGHT endpoint of final bin */
   ogiveX = ogiveX + dx;
   cumProp = 1;
drop dx _:;              /* drop variables that begin with underscore */

The Ogive data set contains all the information that you need to graph an ogive. The following call to PROC SGPLOT uses a VLINE statement, which treats the endpoints of the bins as discrete values. You could also use the SERIES statement, which treats the endpoints as a continuous variable, but might not put a tick mark at each bin endpoint.

title "Cumulative Distribution of Binned Values (Ogive)";
proc sgplot data=Ogive;
   vline OgiveX / response=cumProp markers;
   /* series x=_minpt_ y=cumProp / markers; */
   xaxis grid label="Miles per Gallon (City)";
   yaxis grid values=(0 to 1 by 0.1) label="Cumulative Proportion";
An ogive created from a histogram. Each marker shows the cumulative proportion of the data values that lie below each histogram bin.

You can use the graph to estimate the percentiles of the data. For example:

  • The 20th percentile is approximately 17 because the curve appears to pass through the point (17, 0.20). In other words, about 20% of the vehicles get 17 mpg or less.
  • The 50th percentile is approximately 19 because the curve appears to pass through the point (19, 0.50).
  • The 90th percentile is approximately 27 because the curve appears to pass through the point (27, 0.90). Only 10% of the vehicles have a fuel efficiency greater than 27 mpg.

Compare an ogive to an empirical cumulative distribution

You might wonder how well the ogive approximates the empirical CDF. The following graph overlays the ogive and the ECDF for this data. You can see that the two curves agree closely at the ogive values, shown by the markers. However, there is some deviation because the ogive assume a linear accumulation (a uniform distribution) of data values within each histogram bin. Nevertheless, this coarse piecewise linear curve that is based on binned data does a good job of showing the basic shape of the empirical cumulative distribution.

Comparison of an ogive and an empirical cumulative distribution
Post a Comment