Visualize missing data in SAS

You can visualize missing data. It sounds like an oxymoron, but it is true.

How can you draw graphs of something that is missing? In a previous article, I showed how you can use PROC MI in SAS/STAT software to create a table that shows patterns of missing data in SAS. In addition to tables, you can use graphics to visualize patterns of missing data.

Counts of observations that contain missing values

As shown in the previous post, it is useful to be able to count missing values across rows as well as down columns. These row and column operations are easily accomplished in a matrix language such as SAS/IML. For example, the following SAS/IML statement read in data from the Sashelp.Heart data set. A single call to the COUNTMISS function counts the number of missing values in each row. The BAR subroutine creates a bar chart of the results:

proc iml;
varNames = {AgeAtStart Height Weight Diastolic 
            Systolic MRW Smoking Cholesterol};
use Sashelp.Heart;                         /* open data set */
read all var varNames into X;              /* create numeric data matrix, X */
close Sashelp.Heart;
 
title "Counts of Rows That Contain 0, 1, 2, or 3 Missing Values";
title2 "Sashelp.Heart Data";
Count = countmiss(X,"row");            /* count missing values for each obs */
call Bar(Count);                       /* create bar chart */
Visualize Missing Data in SAS: Count of observations that contain missing values

The bar chart clearly shows that most observations do not contain any missing values among the specified variables. A small percentage of observations contain one missing value. Even fewer contain two or three missing values.

Which observations contain missing values?

It can be useful to visualize the locations of observations that contain missing values. Are missing values spread uniformly at random throughout the data? Or do missing values appear in clumps, which might indicate a systematic issue with the data collection?

The previous section computed the Count variable, which is a vector that has the same number of elements as there are rows in the data. To visualize the locations (rows) of missing values, use the LOC function to find the rows that have at least one missing value and plot the row indices:

missRows = loc(Count > 0);                  /* which rows are missing? */
title "Location of Rows That Contain Missing Values";
call histogram( missRows ) scale="count"    /* plot distribution */
     rebin={125, 250}                       /* bin width = 250 */
     label="Row Number";                    /* label for X axis */
Visualize Missing Data in SAS: Location of rows that contain missing values

This histogram is very revealing. The bin width is 250, which means that each bar includes 250 observations. For the first 2000 observations, about 15 of every 250 observations contain a missing value. Then there is a series of about 500 observations that do not contain any missing observations. For the remaining observations, only about three of every 250 observations contain missing values. It appears that the prevalence of missing values changed after the first 2000 observations. Perhaps the patients in this data set were entered according to the date in which they entered the program. Perhaps after the first 2000 patients were recruited, there was a change in data collection that resulted in many fewer missing values.

A heat map of the missing values

The ultimate visualization of missing data is to use a heat map to plot the entire data matrix. You can use one color (such as white) to represent nonmissing elements and another color (such as black) to represent missing values.

For many data sets (such as Sashelp.Heart), missing observations represent a small percentage of all rows. For data like that, the heat map is primarily white. Therefore, to save memory and computing time, it makes sense to visualize only the rows that have missing values. In the Sashelp.Heart data (which has 5209 rows), only 170 rows have missing values. For each of the 170 rows, you can plot which variables are missing.

The following statements implement this visualization of missing data in SAS. The matrix Y contains only the rows of the data matrix for which there is at least one missing value. You can call the CMISS function in Base SAS to obtain a binary matrix with values 0/1. The HEATMAPDISC subroutine in SAS/IML enables you to create a heat map of a matrix that has discrete values.

ods graphics / width=400px height=600px;
Y = X[missRows,];              /* extract missing rows   */
call HeatmapDisc( cmiss(Y) )   /* CMISS returns 0/1 matrix */
     displayoutlines=0 
     colorramp={white black}
     xvalues=VarNames          /* variable names along bottom */
     yvalues=missRows          /* use nonmissing rows numbers as labels */
     showlegend=0 title="Missing Value Pattern";
Visualize Missing Data in SAS: Heat Map that Shows Missing Value Pattern

The black line segments represent missing values. This heat map summarizes almost all of the information about missing values in the data. You can see which variables have no missing values, which have a few, and which have many. You can see that the MRW variable is always missing when the Weight variable is missing. You can see that the first 2250 observations have relatively many missing values, that there is a set of observations that are complete, and that the missing values occur less frequently for the last 2000 rows.

If you do not have SAS/IML software, you can still create a discrete heat map by using the Graph Template Language (GTL).

Be aware that this heat map visualization is limited to data for which the number of rows that have missing values is somewhat small. (About 1000 rows or less.) In general, a heat map should contain at least one vertical pixel for each data row that you want to visualize. For the Sashelp.Heart data, there are 170 rows that have missing data, and the heat map in this example has 600 vertical pixels. If you have thousands of rows and try to construct a heat map that has only hundreds of pixels, then multiple data rows must be mapped onto a single row of pixels. The result is not defined, but certainly not optimal.


Do you ever visualize missing data in SAS? What techniques do you use? Has visualization revealed something about your data that you hadn't known before? Leave a comment.

Post a Comment

Examine patterns of missing data in SAS

Missing data can be informative. Sometimes missing values in one variable are related to missing values in another variable. Other times missing values in one variable are independent of missing values in other variables. As part of the exploratory phase of data analysis, you should investigate whether there are patterns in the missing data.

Counting is the simplest analysis of any data (missing or otherwise). For each variable, how many observations are missing? You can then proceed to more complex analyses. Do two variable share a similar missing value pattern? Can you predict that one variable will be missing if you know that another variable is missing?

This article shows a simple way to examine patterns of missing values in SAS. The example data is the Sashelp.Heart data set, which contains information about patients in the Framingham Heart Study. The meaning of most variables is evident from the variable's name. An exception is the MRW variable, which contains a patient's "Metropolitan Relative Weight." The MRW is a percentage of the patient's weight to an ideal weight. Thus an MRW score of 100 means "ideal weight" whereas a score of 150 means "50% heavier than ideal." The MRW is similar to the more familiar body-mass index (BMI).

The number of missing values for each variable

I have previously shown how to use PROC FREQ to count the number of missing values for numeric and character variables. The technique uses a custom format to classify each value as "Missing" or "Not Missing."

Alternatively, for numeric variables you can use PROC MEANS to count the number of missing values. PROC MEANS creates a compact easy-to-read table that summarizes the number of missing values for each numerical variable. The following statements use the N and NMISS options in the PROC MEANS statement to count the number of missing values in eight numerical variables in the Sashelp.Heart data set:

/* count missing values for numeric variables */
proc means data=Sashelp.Heart nolabel N NMISS;
var AgeAtStart Diastolic Systolic Height Weight MRW 
    Smoking Cholesterol;
run;
missingpatterns1

The NMISS column in the table shows the number of missing values for each variable. There are 5209 observations in the data set. Three variables have zero missing values, and another three have six missing values. The Smoking variable has 36 missing values whereas the Cholesterol variable has 152 missing values.

These univariate counts are helpful, but they do not tell you whether missing values for different variables are related. For example, there are six missing values for the Height, Weight, and MRW variables. How many patients contributed to those six missing value? Ten? Twelve? Perhaps there are only six patients, each with missing values for all three variables? If a patient has a missing Height, does that imply that Weight or MRW is missing also? To answer these questions you must look at the pattern of missing values.

Patterns of missing values

The MI procedure in SAS/STAT software is used for multiple imputation of missing values. PROC MI has an option to produce a table that summarizes the patterns of missing values among the observations. The following call to PROC MI uses the NIMPUTE=0 option to create the "Missing Data Patterns" table for the specified variables:

ods select MissPattern;
proc mi data=Sashelp.Heart nimpute=0;
var AgeAtStart Height Weight Diastolic 
    Systolic MRW Smoking Cholesterol;
run;
Missing Data Patterns table from PROC MI

The table reports the number of observations that have a common pattern of missing values. In addition to counts, the table reports mean values for each group. Because the table is very wide, I have truncated some of the mean values.

The first row counts the number of complete observation. There are 5039 observations that have no missing values.

Subsequent rows report the number of missing values for variables, pairs of variables, triplets of variables, and so forth. The variables are analyzed from right to left. Thus the second row shows that 124 observations have missing values for only the rightmost variable, which is Cholesterol. The third row shows that eight observations have missing values for only the Smoking variable. The fourth row shows that 28 observations have missing values for both Smoking and Cholesterol.

The table continues by analyzing the remaining variables that have missing values, which are Height, Weight, and MRW. You can see that MRW is never missing by itself, but that it is always missing when Weight is missing. Height is missing by itself in four cases, and is missing simultaneously with Weight in two cases.

Notice that each row of the table represents a disjoint set of observations. Consequently, we can easily answer the previous questions about how many patients contribute to the missing values in Height and Weight. There are 10 patients: four are missing only the Height measurement, four are missing only the Weight measurement (which forces MRW to be missing), and two are missing both Height and Weight.

This preliminary analysis has provided important information about the distribution of missing data in the Sashelp.Heart data set. Most patients (96.7%) have complete data. The most likely measurement to be missing is Cholesterol, followed by information about whether the patient smokes. You can see exactly how many patients are missing Height measurements, Weight measurements, or both. It is obvious that the MRW variable has a missing value if and only if the Weight variable is missing.

The "Missing Data Patterns" table from PROC MI provides a useful summary of missing values for each combination of variables. Examining patterns of missing values can lead to insight into the data collection process, and is also the first step prior to modeling missing data by using multiple imputation. In my next post, I will show how to use basic graphics to visualize patterns of missing data.

Post a Comment

Head-tail versus head-head: A counterintuitive property of coin tosses

I saw an interesting mathematical result in Wired magazine. The original article was about mathematical research into prime numbers, but the article included the following tantalizing fact:

If Alice tosses a [fair] coin until she sees a head followed by a tail, and Bob tosses a coin until he sees two heads in a row, then on average, Alice will require four tosses while Bob will require six tosses (try this at home!), even though head-tail and head-head have an equal chance of appearing after two coin tosses.

Yes, I would classify that result as counter-intuitive! When I read this paragraph, I knew I had to simulate these coin tosses in SAS.

Why does head-tail happen sooner (on average)?

Define the "game" for Alice as a sequence of coin tosses that terminates in a head followed by a tail. For Bob, the game is a sequence of coin tosses that terminates in two consecutive heads.

You can see evidence that the result is true by looking at the possible sequence of tosses that end the game for Alice versus Bob. First consider the sequences for Alice that terminate with head-tail (HT). The following six sequences terminate in four or fewer tosses:

  HT
 THT
 HHT
THHT
TTHT
HHHT

In contrast, among Bob's sequences that terminate with head-head (HH), only four require four or fewer tosses:

  HH
 THH
TTHH
HTHH

For Bob to "win" (that is, observe the sequence HH), he must first toss a head. On the next toss, if he obtains a head again, he wins. But if he observes a tail, he is must start over: He has to toss until a head appears, and the process continues.

In contrast, for Alice to "win" (that is, observe the sequence HT), she must first toss a head. If the next toss shows a tail, she wins. But if is a head, she doesn't start over. In fact, she is already halfway to winning on the subsequent toss.

This result is counterintuitive because a sequence is just as likely to terminate with HH as with HT. So Alice and Bob "win" equally often if they are observing flips of a single fair coin. However, the sequences for which Alice wins tend to be shorter than those for which Bob wins.

A simulation of coin tosses

The following SAS DATA step simulates Alice and Bob tossing coins until they have each won 100,000 games. For each player, the program counts how many tosses are needed before the "winning" sequence occurs. A random Bernoulli number (0 or 1 with probability 0.5) simulates the coin toss.

data CoinFlip(keep= player toss);
call streaminit(123456789);
label Player = "Stopping Condition";
do i = 1 to 100000;
   do player = "HT", "HH";               /* first Alice tosses; then Bob */
      first = rand("Bernoulli", 0.5);    /* 0 is Tail; 1 is Head */
      toss = 1;  done = 0;
      do until (done);
         second = rand("Bernoulli", 0.5);
         toss = toss + 1;                /* how many tosses until done? */
         if player = "HT" then           /* if Alice, ...        */
            done = (first=1 & second=0); /* stop when HT appears */
         else                            /* if Bob, ...          */
            done = (first=1 & second=1); /* stop when HH appears */
         first = second;                 /* remember for next toss */
      end;
      output;
   end;
end;
run;
 
proc means data=CoinFlip MAXDEC=1 mean median Q3 P90 max;
   class player;
   var toss;
run;
Sample statistics for number of tosses required to end a simulated coin game

The MEANS procedure displays statistics about the length of each game. The mean (expected) length of the game for Bob (HH) is 6 tosses. 75% of games were resolved in eight tosses or less and 90% required 12 or less. In the 100,000 simulated games, the longest game that Bob played required 61 tosses before it terminated. (The maximum game length varies with the random number seed.)

In contrast, Alice resolved her 100,000 games by using about 50% fewer tosses. The mean (expected) length of the game for Alice (HT) is 4 tosses. 75% of the games required five or fewer tosses and 90% were completed in seven or fewer tosses. In the 100,000 simulated games, the longest game that Alice played required 22 tosses.

Distribution of tosses for two games with different winning criteria

You can create a comparative histogram to compare the distribution of the game lengths. The distribution of the length of Bob's games is shown in the upper histogram. It has a long tail.

In contrast, the distribution of Alice's game lengths is shown in the bottom of the panel. The distribution drops off much more quickly than for Bob's games.

It is also enlightening to see the empirical cumulative distribution functions (ECDFs) for the number of tosses for each player. You can download a SAS program that creates the comparative histogram and the ECDFs. For skeptics, the program also contains a simulation that demonstrates that a sequence terminates with HH as often as it terminates with HT.

The result is not deep, but it reminds us that the human intuition gets confused by conditional probability. Like the classic the Monty Hall problem, simulation can convince us that a result is true, even when our intuition refuses to believe it.

Post a Comment

Set attributes of markers in PROC SGPLOT by using ODS style elements

The SG procedures in SAS use aesthetically pleasing default colors, shapes, and styles, but sometimes it is necessary to override the default attributes. The MARKERATTRS= option enables you to override the default colors, symbols, and sizes of markers in scatter plots and other graphs. Similarly, the LINEATTRS= option enables you to override the default line colors, thickness, and pattern (solid, dashed, dotted, and so forth).

Many programmers specify hard-coded values such as MARKERATTRS=(COLOR=red SYMBOL=CircleFilled). However, sometimes it is important for colors and symbols to match the style elements in the current ODS style. For example, you might want to match the colors and symbols that appear in some other graph.

An example is shown below. The data set is the Sashelp.Cars data, which contains information on car and trucks. The first scatter plot displays a scatter plot where each marker is colored according to the value of the Origin variable, which has the values "Asia," "Europe," and "USA." The second scatter plot uses the WHERE statement to restrict the display to vehicles that were manufactured in the USA. For simplicity, only the most important statements are shown. You can download the complete SAS program that generates these graphs.

title "Vehicles: All Origins"; 
proc sgplot data=sashelp.cars;
  scatter x=wheelbase y=weight / markerattrs=(symbol=CircleFilled) group=origin;
run;
 
title "Vehicles: Origin=USA only";
proc sgplot data=sashelp.cars;
  where origin="USA";
  scatter x=wheelbase y=weight / markerattrs=(symbol=CircleFilled);
run;
odsstyleattrs1

The graphs look the way they do because the SG procedures follow certain rules for displaying grouped versus non-grouped data. The first graph has three groups. The GROUP= option causes the SGPLOT procedure to assign the ODS element GraphData1 to the first group of markers ("Asia"). Similarly, the ODS element GraphData2 is assigned to the second group ("Europe") and the GraphData3 element is assigned to the third group ("USA"). (These names are described in the documentation for ODS style elements.) In the HTMLBlue style, which is the style used for these graphs, each markers is colored a shade of blue, red, or green according to its group membership.

The second plot shows only the "USA" vehicles. Whereas these markers were colored green in the first plot, they are colored blue in the second plot. This shade of blue comes from the GraphDataDefault element. The marker color is GraphDataDefault:ContrastColor, the marker size is GraphDataDefault:MarkerSize, and the marker symbol is GraphDataDefault:MarkerSymbol.

If you are displaying both graphs, you would certainly want the markers in the second graph to use the same shade of green as the "USA" vehicles in the first graph. This shade of green does not have a standard name, so you cannot specify a hard-coded color such as MARKERATTRS=(color=DarkGreen). Instead, use the MARKERATTRS= option to specify the marker attributes in the GraphData3 element, as follows:

proc sgplot data=sashelp.cars;
  where origin="USA";
  scatter x=wheelbase y=weight / markerattrs=GraphData3;
run;
odsstyleattrs2

The MARKERATTRS= option tells the scatter plot to use markers that have the color, symbol, and size of the GraphData3 style element. For the default HTMLBlue style, that means dark green and open circles, as shown.

The second plot now looks similar to the "USA" vehicles in the first plot. Using the GraphData3 style changed the color, but the symbol is incorrect. To override an attribute of a style element, specify the attribute in parentheses AFTER the style element. For this example, specify the SYMBOL= option in parentheses, as follows:
proc sgplot data=sashelp.cars;
  where origin="USA";
  scatter x=wheelbase y=weight / markerattrs=GraphData3(symbol=CircleFilled);
run;
odsstyleattrs3

Voila! The task is complete. The second plot now has the same marker attributes as the first plot. ( View the plots side by side.) If you were to change the ODS style from HTMLBlue to some other style, the colors would still match.

Notice that PROC SGPLOT does not support specifying the color by using the name of the style element. For example, the following is invalid in PROC SGPLOT:

scatter x=x y=y / markerattrs=(GraphData3:ContrastColor); /* NOT supported */

The Graph Template Language (GTL) supports specifying style elements using the "colon notation," but that syntax is not supported in the SG procedures.

The SAS documentation contains two sections that explain ODS style elements and how to use them in the SG procedures:

As indicated above, the same ideas apply to line attributes, region attributes, and so forth. For example, in a SERIES statement you can use LINEATTRS=GraphFit2(thickness=3) to assign attributes of the GraphFit2 style and ovverride the THICKNESS attribute.

Thanks to Sanjay Matange and Dan Heath for several discussions that improved my understanding of ODS style elements and attributes.

Post a Comment

Generate points uniformly inside a d-dimensional ball

Last week I showed how to generate random points uniformly inside a 2-d circular region. That article showed that the distance of a point to the circle's center cannot be distributed uniformly. Instead, you should use the square root of a uniform variate to generate 2-D distances to the origin. This result generalizes to higher dimensions. If you want to simulate points uniformly in the d-dimensional ball, generate the radius to be proportional to the dth-root of a uniform random variate.

Although it is possible to generalize the polar mapping and obtain a parameterization of the d-dimensional ball, there is an easier approach to simulate points in a ball. The basic idea is to simulate d independent standardized normal variables, project them radially onto the unit sphere, and then adjust their distance to the origin appropriately. You can find this algorithm in many textbooks (Devroye, 1986; Fishman, 1996), but Harman and Lacko (2010) summarize the process nicely. To paraphrase:

If Y is drawn from the uncorrelated multivariate normal distribution, then S = Y / ||Y|| has the uniform distribution on the unit d-sphere. Multiplying S by U1/d, where U has the uniform distribution on the unit interval (0,1), creates the uniform distribution in the unit d-dimensional ball.

It is easy to carry out this operation in SAS. In Base SAS, you can write a DATA step that uses arrays. The following program provides a SAS/IML solution. I used SAS/IML Studio to create a 3-D rotating scatter plot of the simulated cloud of points. You can download the SAS program that contains the DATA step and the IMLPlus program that creates the animation.

proc iml;
call randseed(12345);
d = 3;                               /* dimension = number of variables */
N = 1000;                            /* sample size = number of obs     */
radius = 2;                          /* radius of circle */
Y = randfun(N // d, "Normal");       /* Y ~ MVN(0, I(d)) */
u = randfun(N, "Uniform");           /* U ~ U(0,1)       */
r = radius * u##(1/d);               /* r proportional to d_th root of U */
X = r # Y / sqrt(Y[,##]);            /* Y[,##] is sum of squares for each row */
/* X contains N random uniformly distributed points in the d-ball */
Sim3D
Post a Comment

The WHERE clause in SAS/IML

In SAS procedures, the WHERE clause is a useful way to filter observations so that the procedure receives only a subset of the data to analyze. The IML procedure supports the WHERE clause in two separate statements.

  • On the USE statement, the WHERE clause acts as a global filter. The where clause applies to all subsequent READ statements that read from the open data set.
  • On the READ statement, the WHERE clause further filters the data. Because you can use multiple READ statements, you can easily create matrices that contain disjoint or overlapping subsets of the data. However, be aware multiple READ statements might result in SAS reading the data multiple times, depending on the syntax that you specify.

An interesting fact about the WHERE clause in SAS/IML is that you can specify run-time expressions for the WHERE clause, which makes it a very powerful tool for data analysis.

The WHERE clause in the USE statement

The USE statement opens a SAS data set for read access. When you need only one filter, specify a WHERE clause on the USE statement. For example, suppose that you want a matrix that contains the age, weight, and height of all females in the Sashelp.Class data. The following program reads the female observations into the matrix X and prints the average age, weight, and height:

proc iml;
varNames = {"Age" "Weight" "Height"};
use Sashelp.Class where(sex='F');       /* restrict to females */
   read all var varNames into X;
close Sashelp.Class;
 
avg = mean(X);
print avg[L="Mean Values for Females" colname=varNames format=5.2];
whereiml1

The WHERE clause in the READ statement

You can also put the WHERE clause in the READ statement. This technique is useful if you intend to read the data several times. For example, the following program reads data for females into the X matrix and data for males into the Y matrix:

use Sashelp.Class;
   read all var varNames into X where(sex='F');
   read all var varNames into Y where(sex='M');
close Sashelp.Class;

If you use a WHERE clause on both the USE and READ statements, the SAS log will include the NOTE
NOTE: WHERE clause has been augmented
to inform you that the data filter combines both WHERE clauses by using a "logical AND" operator.

Expressions in the WHERE clause

Beginning with SAS/IML 13.1 (released with SAS 9.4m1), you can use expressions in WHERE clauses. This means that you can call the READ statement in loop. During each iteration, you can read and analyze various subsets of the data during each iteration.

For example, suppose that you have several grouping variables and you want to conduct a BY-group analysis. You can use the UNIQUEBY technique to conduct a BY-group analysis with several variables. However, the UNIQUEBY technique requires that the data be sorted and fit in RAM. It also requires a bit of "bookkeeping" because you need to keep track of indices. If you don't mind the inefficiency of reading the data multiple times, a WHERE clause approach is conceptually easier to program.

As an example, suppose that you want to analyze the MPG_City variable in the Sashelp.Cars data set for each combinations of the Origin and Type variables. To keep it simple, suppose that you want to compute the mean value of MPG_City for all pairwise combinations of the Origin and Type variables, excluding the observations for American-made vehicles. This analysis is simple by using PROC MEANS. (The output for PROC MEANS is not shown.)

proc means data=Sashelp.Cars Mean;
where Origin ^= "USA";
   class Origin Type;
   var MPG_City;
run;

In PROC IML, this computation requires looping over the valid combinations of Origin and Type. To make the analysis simpler, the following call to PROC FREQ writes the valid combinations to a SAS data set:

proc freq data=Sashelp.Cars noprint;
where Origin ^= "USA";
tables Origin*Type / nocum norow nocol nopercent
                     out=FreqOut;  /* unique combinations of Origin and Type */
run;

In PROC IML, you can read the FreqOut data to obtain the unique combinations of the Origin and Type variables. You can iterate over these combinations, reading the Sashelp.Cars data multiple times. During each iteration, you can analyze one of the BY groups, as follows:

proc iml;
use FreqOut;
   read all var {Origin Type};           /* read unique levels of BY groups */
close FreqOut;
NumGroups = nrow(Origin);
 
use Sashelp.Cars where(Origin ^= "USA"); /* open data set for reading */
Stats = j(NumGroups, 2);                 /* allocate vector for results */
do i = 1 to NumGroups;                   /* for each BY group... */
   /* read unsorted data to obtain the i_th BY group */
   /* Notice the EXPRESSIONS in the WHERE clause! */
   read all var {MPG_City} where(origin=(Origin[i]) & type=(Type[i])); 
   Stats[i, 1] = countn(MPG_City);       /* number of nonmissing obs */
   Stats[i, 2] = mean(MPG_City);         /* analyze this BY group */
end;
close Sashelp.Cars;
 
print Origin Type Stats[colname={"N" "Mean"}];
whereiml2

The result of the analysis is similar to the output from PROC MEANS. Notice the use of expressions in the WHERE clause in the READ statement. The expression origin=(Origin[i]) is interpreted as follows:

  • The left side of the equal sign (origin) specifies the name of a variable in the open data set.
  • The right side of the equal sign must be enclosed in parentheses unless it is a literal constant.
  • The expression inside the parentheses can be any matrix computation that results in a scalar value, including calls to built-in or user-defined functions.

The example program reads the data set 10 times, once for each unique combination of Origin and Type. Although re-reading data is inefficient, there are three advantages: the data set does not need to be sorted, only one BY group at a time is ever in RAM, and the program statements are easy to write. By using this method, you do not have to keep track of sorting, indexing, or extracting the data. The WHERE clause in SAS/IML does the work for you.

Post a Comment

Generate points uniformly inside a circular region in 2-D

It is easy to generate random points that are uniformly distributed inside a rectangle. You simply generate independent random uniform values for each coordinate. However, nonrectangular regions are more complicated. An instructive example is to simulate points uniformly inside the ball with a given radius. The two-dimensional case is to generate random points uniformly inside the planar region that contains points within a certain distance to the origin. Mathematicians call this region the two-dimensional ball or disk; others just say "inside a circle."

To generate data uniformly in any planar region, you can use the acceptance-rejection algorithm: You generate points uniformly in a bounding rectangle and then reject and discard any values that are outside the region of interest. This method always works, but it is inefficient in high dimensions (and for regions with small volumes) because most points that you generate get rejected. This article shows how to directly generate points with uniform density inside a circular region.

How NOT to simulate data in a disk

If you are familiar with polar coordinates, you might think that this problem is simple. The function
(r, θ) → (r sin θ, r cos θ)
parameterizes a circular region in terms of radii and angles. As a first attempt, you might try to simulate uniform values of a radius and angle, and then use the polar equations to convert to Euclidean coordinates, as shown in the following SAS DATA step:

/* This DATA step is WRONG; the points are not uniform */
data Ball1(drop=radius twopi);
call streaminit(12345);
radius = 2;                           /* use circle of radius 2 */
twopi = 2 * constant("PI");
do i = 1 to 1000;                     /* simulate 1000 points */
   theta = twopi * rand("uniform");   /* angle is uniform */
   r = radius* rand("uniform");       /* radius is uniform : THIS IS WRONG! */
   x = r*cos(theta);
   y = r*sin(theta);
   output;
end;
run;

Unfortunately, this simple approach is not correct. The points are uniformly distributed in the (r, θ) coordinates, but not in the (x, y) coordinates. The following scatter plot displays the (x,y) points and the circle of radius 2.

Random point in disk

It is apparent from the image that there are more points near the origin than further away from the origin. The points are not uniformly distributed.

After seeing this image, it's not hard to realize the mistake in the simulation. When we say that points should be "uniformly distributed," we mean that the probability of generating a point in any finite region is proportional to the area of that region. However, the previous program does not consider area, only angles and radii.

The area of a circle with radius r is πr2, which is proportional to the square of the radius. However, the program generates points inside a circle of radius r with a probability that is proportional to r. To be correct, the probability should be proportional to the square of r.

How to simulate data in a disk

You can adjust the probability by changing the distribution of the r variable. Instead of sampling r uniformly, sample according to the square root of the uniform density (Devroye, 1986, p. 236). This density decreases the likelihood of points being generated near the origin and increases the probability that points will be generated near the edge of the circular region. Only one statement in the DATA step needs to change. The following SAS DATA step is the correct way to simulate points uniformly in a disk:

/* correct way to generate points uniformly in a 2-D circular region (ball) */
data Ball(drop=radius twopi);
call streaminit(12345);
radius = 2;                                /* use circle of radius 2 */
twopi = 2 * constant("PI");
do i = 1 to 1000;                          /* simulate 1000 points */
   theta = twopi * rand("uniform");        /* angle is uniform */
   r = radius * sqrt( rand("uniform") );   /* radius proportional to sqrt(U), U~U(0,1) */
   x = r*cos(theta);
   y = r*sin(theta);
   output;
end;
run;
Random points distributed uniformly in a disk

The new graph of the simulation shows that the points are uniformly distributed within the circular region. You can use PROC KDE to verify that the density inside the circle is approximately constant and equal to 1/(4π), which is 1 divided by the area of the circular region.

In a future post, I will show how to generalize this example and simulate random points uniformly within a d-dimensional ball for any d ≥ 2.

Post a Comment

Save descriptive statistics for multiple variables in a SAS data set

Descriptive univariate statistics are the foundation of data analysis. Before you create a statistical model for new data, you should examine descriptive univariate statistics such as the mean, standard deviation, quantiles, and the number of nonmissing observations.

In SAS, there is an easy way to create a data set that contains the descriptive statistics for every numerical variable in your data: use the OUTTABLE= option in PROC UNIVARIATE. It doesn't matter if your data has 5 variables or 5,000 variables. That one option writes dozens of statistics for all numerical variables in the data!

PROC MEANS: A great way to display statistics on the screen

First, some basic background about viewing descriptive statistics. Many people learned that PROC MEANS is the standard way to display descriptive statistics on your computer screen. You specify the statistics that you want on the PROC MEANS statement, as follows:

/* can produce 38 descriptive statistics */ 
proc means data=sashelp.cars nolabels
           N Mean Std Min Q1 Median Q3 Max;   /* specify the statistics */
run;       /* omit VAR statement to analyze all numerical variables */
descstatuni1

Notice that by omitting the VAR statement, the procedure analyzes all numerical variables. The output is an ODS table named "Summary."

That was easy. However, suppose that you do not care about seeing the statistics on your computer monitor, but you want them written to a SAS data set. Because you are analyzing many variables and want many statistics, it is somewhat awkward to use the OUTPUT statement in PROC MEANS. An easier way is to use ODS to write the statistics to a data set by doing the following:

  • Use the ODS EXCLUDE statement to suppress output to the monitor.
  • Specify the statistics that you want to see: the mean, standard deviation, median, and so forth.
  • Specify the STACKODSOUTPUT option on the PROC MEANS statement to tell the procedure to preserve the tabular form of the statistics table.
  • Use the ODS OUTPUT statement to write the "Summary" table to a SAS data set.

The solution now requires several lines of typing. It's not onerous, but it is a lot to remember:

ods exclude all;                  /* suppress display to open ODS destinations */
proc means data=sashelp.cars 
           N Mean Std Min Q1 MEDIAN Q3 Max  /* type every statistic you want */
           STACKODSOUTPUT;        /* preserve table form of output */
ods output Summary=MeansSummary;  /* write statistics to data set */
run;
ods exclude none;                 /* restore normal display of tables */

The data set MeansSummary contains the statistics for every numerical variable in the original data. This solution requires using ODS statements to direct the output to a data set while suppressing it from appearing on the screen. It is a nice trick in general, but there is an easier way for this particular task.

PROC UNIVARIATE and the OUTTABLE= option

I've been using SAS for 20 years, but somehow I never learned about the OUTTABLE= option in PROC UNIVARIATE until last week. While reading the documentation I learned that the OUTTABLE= option creates an output data set that contains 46 descriptive statistics for each numerical variable on the VAR statement.

If you omit the VAR statement, PROC UNIVARIATE analyzes all numerical variables. Therefore, by specifying the OUTTABLE= option, you can duplicate the PROC MEANS example without using ODS. As an extra bonus, you do not have to type the names of any statistics.

The following statement computes 46 statistics for each numerical variable in the input data set. The data set UniSummary contains the results. The NORMAL option tells PROC UNIVARIATE to include the Kolmogorov-Smirnov test for normality in the statistics that it creates:

proc univariate data=sashelp.cars outtable=UniSummary NORMAL noprint;
run;            /* omit VAR statement to analyze all numerical variables */
 
proc contents data=UniSummary short;
run;
descstatuni2

By default, PROC UNIVARIATE produces about two pages of tables for each numerical variable, but the NOPRINT option suppresses the display of tables. The output from PROC CONTENTS shows the names of variables in the output data set. Each name represents a statistic, such as CSS (cumulative sum of squares), CV (coefficient of variation), and GEOMEAN (geometric mean). The names begin and end with an underscore. The following statements print the same summary statistics that PROC MEANS produced; the output is not shown.

proc print data=UniSummary label;
   var _var_ _label_ _NObs_ _Mean_ _Std_ _Min_ _Max_;
run;

For both procedures, you can use the CLASS statement to obtain statistics for subgroups of the data. For example, you can include the statement CLASS origin to obtain summary statistics for each variable and grouped according to whether a vehicle was manufactured in Europe, Asia, or the USA.

It is worth noting that PROC UNIVARIATE has options (for example, WINSORIZED= and TRIMMED=) that compute statistics that are not saved to the OUTTABLE= data set. However, you can use ODS to capture those statistics.

In summary, this article shows how to create a data set for descriptive statistics by using PROC MEANS or PROC UNIVARIATE. You can use either procedure, but using the OUTTABLE= option in PROC UNIVARIATE saves you some typing. The OUTTABLE= data set also includes some robust statistics that are not available in PROC MEANS.

Post a Comment

High school rankings of top NCAA wrestlers

Last weekend was the 2016 NCAA Division I wrestling tournament. In collegiate wrestling there are ten weight classes. The top eight wrestlers in each weight class are awarded the title "All-American" to acknowledge that they are the best wrestlers in the country.

I saw a blog post on the InterMat web site that lists each All-American and the wrestler's national ranking when he was a high school senior. The data are interesting, but I wanted a simple graph that visualized the data. I decided to use SAS to create graphs that show the high school ranking for each of this year's 80 All-American wrestlers.

I was also interested in the relationship between high school ranking and placement at the NCAA tournament. Were the top NCAA wrestlers already nationally recognized while still in high school? Or were there some "late bloomers" who perfected their skills after entering college?

Ranking wrestlers

Several web sites, magazines, and organizations try to rank the top 50 or 100 US high school wrestlers in each weight class. It can be challenge to rank two individuals who have never wrestled each other. However, many of the top contenders in each weight class go to national tournaments, so there is often head-to-head data as well as data for common opponents.

An even more difficult challenge is attempting to rank the best wrestlers regardless of weight classes. How do you compare an undefeated heavyweight with an undefeated lightweight? Nevertheless, people do their best and you can find many internet lists that rank the "best pound-for-pound" wrestlers, boxers, MMA fighters, and so forth.

The high school rankings of the 2016 All-Americans

The InterMat article included whether each All-American was ranked in the Top 100 as a senior. If so, it gave the wrestler's rank. (Presumably, using their own ranking system.) If the wrestler was not nationally ranked, it lists whether he was ranked in his weight class (sort of an honorable mention), or whether he was unranked.

After importing the data into SAS, I used a custom format and PROC FREQ to tabulate the high school rankings against the wrestler's place in the NCAA tournament. You can download the data and the SAS program that generates the analyses in this article. The tabular results follow.

t_wrestlingrank

Of the wrestlers who finished first at the NCAA tournament, eight had Top 20 status as a high school senior. The results were similar for second-place finishers. However, if you look at fourth place or lower, you can see that a surprisingly large number of All-Americans who were unranked in high school. Still, with the exception of fifth place winners, more than half of each place (1–8) contained ranked wrestlers. Overall, 56 out of the 80 All-Americans were nationally ranked in high school.

PROC FREQ can automatically create a mosaic plot that graphically visualizes the tabular results. Because there are exactly ten wrestlers for each place (1–8), the mosaic plot is actually a stacked bar chart for these data. (There is an alternative way to create a stacked bar chart in SAS.)

wrestlingrank1

For each place, the brown rectangle represents the proportion of place winners who were ranked in the Top 20 in high school. The green rectangle represents the proportion who were not in the Top 20, but were in the Top 100. The pink rectangle shows wrestlers who were ranked in their weight classes. The blue rectangles show All-Americans who were unranked as high school seniors. Those formerly unranked wrestlers are the "late bloomers" who improved markedly and became a top college wrestler.

Association between NCAA place and high school rank

The previous graph shows that most wrestlers who placed first, second, or third were top-ranked high school wrestlers. The InterMat web site includes the exact high school ranking (1—100), so let's plot each wrestler's NCAA place against his high school ranking. To accommodate the wrestlers who were not ranked, I arbitrarily assign the rank "110" to the weight-class-ranked wrestlers. Instead of plotting the value 110 on a graph, I use the abbreviation "WC" for "weight class." I assign the rank "120" to the unranked wrestlers, and label that value by "NR" for "not ranked."

wrestlingrank2

The scatter plot of place versus high school ranking is shown to the left, along with a loess smoother to the data. In order to separate these artificial ranks from the real ranks, I create a broken axis on the graph. The graph indicates that the All-Americans who were very highly ranked in high school placed very well at the NCAA tournament. For example, 14 wrestlers were ranked in the Top 10 in high school. Of those, 10 wrestled in the finals for first or second place, and another four wrestled for third or fourth place.

The association between place and ranking is noticeable until about ranking 25. After that, the loess smoother levels off, which indicates no relationship between high school ranking and placement at the tournament.

I want to emphasize that this sample is not a random selection of collegiate wrestlers. Because of that, you cannot conclude that high school ranking predicts success in college wrestling. The sample here is nonrandom. Therefore the graphs show a relationships given that these men are All-American champions. It is a subtle but important distinction.

Feel free to download the SAS program that created these graphs. Although in general I am not a fan of broken axes, I think they are useful in this case because it makes it clear that the ranks 1–100 are different from the ranks "WC" and "NR". See Sanjay Matange's blog for more conventional applications of broken axes.

This analysis sends a clear message to high school wrestlers who are not nationally ranked: With hard work you can still become a premier collegiate athlete. At the same time, it clearly supports another truism: Many of the best athletes in college were also high school stars.

Post a Comment

Nonparametric regression for binary response data in SAS

My previous blog post shows how to use PROC LOGISTIC and spline effects to predict the probability that an NBA player scores from various locations on a court. The LOGISTIC procedure fits parametric models, which means that the procedure estimates parameters for every explanatory effect in the model. Spline bases enable you to fit complex models, but it is easy to generate many spline effects, which means that you need to be careful not to overfit the data.

In contrast, modern nonparametric models enable you to balance the complexity of a model with the goodness of fit, thus reducing the likelihood of overfitting the data. SAS provides several procedures that fit nonparametric regression models for a binary response variable. Options include:

  • Use variable selection techniques in PROC LOGISTIC or PROC HPGENSELECT to allow the data to select the effects that best model the data. Variable selection creates a hybrid analysis that has properties of nonparametric models while preserving the interpretability of parametric models.
  • Use the GAMPL procedure in SAS/STAT 14.1 (SAS 9.4m3) to fit the data. The GAMPL procedure uses penalized likelihood (PL) methods to fit generalized additive models (GAM).

Other choices in SAS/STAT software include the ADAPTIVEREG procedure, which combines splines with variable selection techniques, and the HPSPLIT procedure, which is a tree-based classification procedure. Both procedures were introduced in SAS/STAT 12.1.

Generalized additive models in SAS

Generalized additive models use spline effects to model nonlinear relationships in data. A smoothing penalty is applied to each spline term in an attempt to model nonlinear features without overfitting the data. For details and examples, you can read the GAMPL documentation or watch a video about PROC GAMPL.

The syntax for the GAMPL procedure is similar to the familiar syntax for PROC LOGISTIC or PROC GENMOD. You can specify spline effects and the distribution of the response variable. The following statement uses a two-dimensional thin-plate spline to model the probability of Stephen Curry scoring from various shooting locations. The data are from Robert Allison's blog "How to graph NBA data with SAS." You can download the complete SAS program that produces the graphs in this post.

proc gampl data=Curry;
   where Shot_Distance <= 30;
   model Shot_Made(event='Made') = Spline(X Y / maxdf=40) / dist=binary;
   id X Y Shot_Made;
   output out=GamPLOut;
run;

The OUTPUT statement saves the predicted probabilities to a data set. The option MAXDF=40 tells the procedure to consider up to 40 degrees of freedom for the spline effect and to choose the smoothing parameter that provides the best tradeoff between model complexity and goodness of fit. For the Stephen Curry data, the optimal smoothing parameter results in 14.7 degrees of freedom.

GAMPL Analysis of Curry Data

You can use the graph template language (GTL) to create a contour plot of the predicted probabilities. The contour map is qualitatively similar to the probabilities that were predicted by the PROC LOGISTIC analysis in my previous post. There is an area of high probability near the basket at (0,0). The probabilities on the right side of the graph are lower than on the left. There is a "hot spot" on the left side of the graph, which corresponds to a high probability that Curry will score from that region.

Verify: The fundamental principle of nonparametric analysis

I initially view the results of any nonparametric analyses with skepticism. I trust the mathematics behind the methods, but I need to be convinced that a qualitative feature in the predicted values is real and not merely an artifact of some complicated nonparametric witchcraft.

There are many statistical techniques that enable you to evaluate whether a model fits data well, but it is wise to perform a basic "sanity check" by using a different nonparametric procedure to analyze the same data. If the two analyses reveal the same qualitative features in the data, that is evidence that the features are truly present. Conversely, if two models produce different qualitative features, then I question whether either model is accurate. I call this sanity check the fundamental principle of nonparametric analysis: Trust, but verify.

Let's apply the fundamental principle to the NBA data by running PROC ADAPTIVEREG:

proc adaptivereg data=Curry plots;
   where Shot_Distance <= 30;
   model Shot_Made(event='Made') = X Y / dist=binary;
   output out=AdaptiveOut p(ilink);
run;
ADAPTIVEREG Analysis of Curry Data

The PROC ADAPTIVEREG analysis is shown to the left. The contour plot shows the same qualitative features that were apparent from the LOGISTIC and GAMPL analyses. Namely, the probability of scoring is high under the basket, low to the right, average up the middle, and high on the left. Seeing these features appear in several analyses gives me confidence that these features of the data are real. After verifying that the models are qualitatively similar, you can investigate which model is better, perhaps by splitting the data into subsets for model training, validation, and testing.

Summary

This article briefly introduced two nonparametric procedures in SAS that can analyze binary response variables and other response distributions. The two analyses produced qualitatively similar predictions on sample data. The fundamental principle of nonparametric analysis is a meta-theorem that says that you should verify the qualitative predictions of a nonparametric model. Reproducibility is a necessary (but not sufficient) condition to believe that a feature is real and not spurious. For this example, all analyses agree that Stephen Curry shoots better from one side of the court than from the other.

Post a Comment