Where did it come from? Adding the source of each observation to a SAS data set

Imagine the following scenario. You have many data sets from various sources, such as individual stores or hospitals. You use the SAS DATA step to concatenate the many data sets into a single large data set. You give the big data set to a colleague who will analyze it. Later that day, he comes back and says "One of these observations is very interesting. Can you tell me which of the original data sets this observation came from?"

There's a simple way to generate the answer this question. When you use the SET option to concatenate the data sources, use the INDSNAME= option on the SET statement. The INDSNAME= option was introduced back in the SAS 9.2 days, but it is not as well-known as it should be, considering how useful it is.

Recording the name of a data source

Let's create some example data. The following SAS macro code creates 10 SAS data sets that each have a variable called X. A subsequent DATA step uses the SET statement to vertically concatenate the 10 data sets into a single data set:

%macro CreateDS;
%do i = 0 %to 9;
   data A&i;  x = &i;  run;
%end;
%mend;
 
%CreateDS;     /* create data sets A0, A1, A2, ..., A9 */
 
data Combined;  /* concatenate into a single data set */
set A0-A9;
run;

The Combined data set has 10 observations, one from each of the original data sets. However, there is no variable that indicates which observation came from which of the original data sets.

Longtime SAS programmers are undoubtedly familiar with the IN= data set option. You can put the IN= option inside parentheses after each data set name. It provides a (temporary) flag variable that you can use during the DATA step. For example, the following DATA step creates a variable named SOURCE that contains the name of the source data set.

/* create SOURCE variable: Using the IN= option is long and tedious */
data Combined;
set A0(in = ds0)  
    A1(in = ds1)  A2(in = ds2)  A3(in = ds3)
    A4(in = ds4)  A5(in = ds5)  A6(in = ds6)
    A7(in = ds7)  A8(in = ds8)  A9(in = ds9);
if ds0 then source = "A0";
else if ds1 then source = "A1";
else if ds2 then source = "A2";
else if ds3 then source = "A3";
else if ds4 then source = "A4";
else if ds5 then source = "A5";
else if ds6 then source = "A6";
else if ds7 then source = "A7";
else if ds8 then source = "A8";
else if ds9 then source = "A9";
run;
 
proc print data=Combined; run;
t_indsname1

An easier way: Use the INDSNAME= option

The IN= data set option gets the job done, but for this task it is messy and unwieldy. Fortunately, there is an easier way. You can use the INDSNAME= option on the SET statement to create a temporary variable that contains the name of the source data set. You can then use the SCAN function to extract the libref and data set name to permanent variables, as follows:

/* create SOURCE variable: Using the INDSNAME= option is quick and easy */
data Combined2;
set A0-A9 indsname = source;  /* the INDSNAME= option is on the SET statement */
libref = scan(source,1,'.');  /* extract the libref */
dsname = scan(source,2,'.');  /* extract the data set name */
run;
 
proc print data=Combined2; run;
t_indsname2

Success! Now the source for each observation is clearly indicated.

For an additional example, see the SAS Sample "Obtain the name of data set being read with SET statement with INDSNAME= option." For more cool DATA step features, see the paper "New in SAS 9.2: It’s the Little Things That Count" (Olson, 2008).

Post a Comment

Large matrices in SAS/IML 14.1

Last week, SAS released the 14.1 version of its analytics products, which are shipped as part of the third maintenance release of 9.4. If you run SAS/IML programs from a 64-bit Windows PC, you might be interested to know that you can now create matrices with about 231 ≈ 2 billion elements, provided that your system has enough RAM. (On Linux operating systems, this feature has been available since SAS 9.3.)

A numerical matrix with 2 billion elements requires 16 GB of RAM. In terms of matrix dimensions, this corresponds to a square numerical matrix that has approximately 46,000 rows and columns. I've written a handy SAS/IML program to determine how much RAM is required to store a matrix of a given size.

If you are running 64-bit SAS on Windows, this article describes how to set an upper limit for the amount of memory that SAS allocate for large matrices.

The MEMSIZE option

The amount of memory that SAS can allocate depends on the value of the MEMSIZE system option, which has a default value of 2GB on Windows. Many SAS sites do not override the default value, which means that SAS cannot allocate more than 2 GB of system memory.

You can run PROC OPTIONS to display the current value of the MEMSIZE option.

proc options option=memsize value;
run;
Option Value Information For SAS Option MEMSIZE
    Value: 2147483648
    Scope: SAS Session
    How option value set: Config File
    Config file name:
            C:\Program Files\SASHome\SASFoundation\9.4\nls\en\sasv9.cfg

The value 2,147,483,648 is shown in the SAS log. The value is unfortunately in bytes. This number corresponds to 2 GB. Unless you change the MEMSIZE option, you will not be able to allocate a square matrix with more than about 16,000 rows and columns. For example, unless SAS can allocate 5 GB or more of RAM, the following SAS/IML program will produce an error message:

proc iml; 
/* allocate 25,000 x 25,000 matrix, which requires 4.7 GB */
x = j(25000, 25000, 0);


ERROR: Unable to allocate sufficient memory.

You can use the MEMSIZE system option to permit SAS to allocate a greater amount of system memory. SAS does not grab this memory and hold onto it. Instead, the MEMSIZE option specifies a maximum value for dynamic allocations.

The MEMSIZE option only applies when you launch SAS, so if SAS is currently running, save your work and exit SAS before continuing.

Changing the command-line invocation for SAS

If you run SAS locally on your PC, you can add the -MEMSIZE command-line option to the shortcut that you use to invoke SAS. This example uses "12G" to permit SAS to allocate up to 12 GB of RAM, but you can use different numbers, such as 8G or 16G.

  1. Locate the "SAS 9.4" icon on your Desktop or the "SAS 9.4" item on the Start menu.
  2. Right-click on the shortcut and select Properties
  3. A dialog box appears. Edit the Target field and insert -MEMSIZE 12G at the end of the current text, as shown in the image.

  4. memsize
  5. Click OK.

Every time you use this shortcut to launch SAS, the SAS process can allocate up to 12 GB of RAM. You can also specify -MEMSIZE 0, which permits allocations up to 80% of the available RAM. Personally, I do not use -MEMSIZE 0 because it permits SAS to consume most of the system memory, which does not leave much for other applications. I rarely permit SAS to use more than 75% of my RAM.

After editing the shortcut, launch SAS and call PROC OPTIONS. This time you should see something like the following:

Option Value Information For SAS Option MEMSIZE
    Value: 12884901888
    Scope: SAS Session
    How option value set: SAS Session Startup Command Line

SAS configuration files

A drawback of the command-line approach is that it only applies to a SAS session that is launched from the shortcut that you modified. In particular, it does not apply to launching SAS by double-clicking on a .sas or .sas7bdat file.

An alternative is to create or edit a configuration file. The SAS documentation has long and complete instructions about how to edit the sasv9.cfg file that sets the system options for SAS when SAS is launched.

SAS 9 creates two default configuration files during installation. Both configuration files are named SASV9.CFG. I suggest that you edit the one in !SASHOME\SASFoundation\9.4, which on many installations is c:\program files\SASHome\SASFoundation\9.4. By default, that configuration file has a -CONFIG option that points to a language-specific configuration file. Put the -MEMSIZE option and any other system options after the -CONFIG option, as follows:

-config "C:\Program Files\SASHome\SASFoundation\9.4\nls\en\sasv9.cfg"
-RLANG
-MEMSIZE 12G

Notice that I also put the -RLANG option in this sasv9.cfg file. The -RLANG system option specifies that SAS/IML software can interface with the R language.

If you now double-click on a .sas file to launch SAS, PROC OPTIONS reports the following information:

Option Value Information For SAS Option MEMSIZE
    Value: 12884901888
    Scope: SAS Session
    How option value set: Config File
    Config file name:
            C:\Program Files\SASHome\SASFoundation\9.4\SASV9.CFG

If you add multiple system options to the configuration file, you might want to go back to the SAS 9.4 Properties dialog box (in the previous section) and edit the Target value to point to the configuration file that you just edited.

Remote SAS servers

If you connect to a remote SAS server and submit SAS/IML programs through SAS/IML Studio, SAS Enterprise Guide, or SAS Studio, a SAS administrator has probably provided a configuration file that specifies how much RAM can be allocated by your SAS process. If you need a larger limit, discuss the situation with your SAS administrator.

Final thoughts on big matrices

You can create SAS/IML matrices that have millions of rows and hundreds of columns. However, you need to recognize that many matrix computations scale cubically with the number of elements in the matrix. For example, many computations on an n x n matrix require on the order of n3 floating point operations. Consequently, although you might be able to create extremely large matrices, computing with them can be very time consuming.

In short, allocating a large matrix is only the first step. The wise programmer will time a computation on a sequence of smaller problems as a way of estimating the time required to tackle The Big Problem.

Post a Comment

Odds ratio plots with a logarithmic scale in SAS

I recently read an argument by Andrew Wheeler for using a logarithmic axis for plotting odds ratios. I found his argument convincing. Accordingly, this blog post shows how to create an odds ratio plot in SAS where the ratio axis is displayed on a log scale. Thanks to Bob Derr who read and commented on an early draft of this article.

As the name implies, the odds ratio is a ratio of two odds. You can look up a detailed explanation, but essentially it is the odds of an event occurring in one group divided by the odds of it occurring in another group. The odds ratio is always positive, and an odds ratio of 1 means that the odds of the event occurring in the two groups is the same.

When plotting an odds ratio, the relevant fact is that it is a ratio. A ratio is not symmetric, and reversing the comparison group results in the reciprocal of the ratio. For example, suppose the odds ratio of a disease is 10 when comparing females to males. This means that the odds of getting the disease for females is 10 times greater than for males. However, it is just as correct to say that the odds ratio is 0.1 when you reverse the groups and compare males to females.

On a linear scale, the distance between 0.1 and 1 appears much smaller than the distance between 1 and 10. However, on a log scale, the distance between 10-1 and 100 (=1) is the same as the distance between 1 and 101; on a log10 scale, both distances are 1. It is only by using a log scale that you can visually compare the magnitudes of confidence intervals and standard errors in an odds ratio plot.

The following example is from a SAS Note about estimating odds ratios. In the following example, patients with one of two diagnoses (complicated or uncomplicated) are treated with one of three treatments (A, B, or C) and the result (cured or not cured) is observed:

data uti;
   input diagnosis : $13. treatment $ response $ count @@;
   datalines;
complicated    A  cured 78  complicated   A not 28
complicated    B  cured 101 complicated   B not 11
complicated    C  cured 68  complicated   C not 46
uncomplicated  A  cured 40  uncomplicated A not 5
uncomplicated  B  cured 54  uncomplicated B not 5
uncomplicated  C  cured 34  uncomplicated C not 6
;
proc logistic data=uti plots(only)=oddsratio;
   freq count;
   class diagnosis treatment / param=glm;
   model response(event="cured") = diagnosis treatment diagnosis*treatment ;
   oddsratio treatment;
   oddsratio diagnosis;
   ods output OddsRatiosWald= ORPlot;  /* save data for later use */
run;
oddsratio1

The default odds ratio plot is shown. Five estimates are less than 1 and four are greater than 1. Four confidence intervals intersect 1, which indicates ratios that are not significantly different from 1.

These intervals are not adjusted for multiple comparisons, so you really shouldn't compare their lengths, but many people use the length of a confidence interval to visualize uncertainty in an estimate, and comparisons are inevitable. From looking at the graph, a casual reader might think that the uncertainty for the third ratio (treatment B vs C at diagnosis=complicated) is the biggest. However, this initial impression ignores the fact that the confidence intervals squished into the interval (0,1] might span several orders of magnitude!

Plotting the odds ratios on a log scale automatically

Several SAS procedures enable you to specify a log scale by using the procedure syntax. For example, the LOGISTIC, GLIMMIX, and FREQ procedures support the LOGBASE=10 option on the PLOTS=ODDSRATIO option to generate the plot automatically, as follows:

proc logistic plots=oddsratio(logbase=10); /* specify log scale */
...
run;
oddsratio2

The new odds ratio plot (click to enlarge) displays exactly the same data, but uses a log scale. In the second graph you can see that the confidence interval for the third item is no longer the widest. This plot presents a more faithful visual description of the uncertainty associated with each estimate, regardless of whether the estimate is less than 1 or greater than 1.

Even comparing estimates is much improved. In the first plot, the sixth ratio (treatment B vs C at diagnosis=uncomplicated), which has the value 1.9, seems to be the second most extreme estimate. In the second plot, you can also see that the first and last estimates are more extreme (further from 1) than the sixth estimate. For example, the last estimate is about 0.26, which is the equivalent to the inverse ratio 1/0.26 ≈ 3.8, which is much greater than 1.9. That fact was not evident in the first plot.

Plotting the odds ratios on a log scale manually

If you compute the odds ratio and confidence limits in a DATA step or in a procedure that does not support odds ratio plots, you can use the SGPLOT procedure to create the odds ratio plot with a logarithmic axis. You can use the SCATTER statement to plot the estimates and the XERRORLOWER= and XERRORUPPER= options to plot the confidence intervals. You can use the TYPE=LOG option on the XAXIS statement to change the scale of the axis. The following PROC SGPLOT statement plots the data in the ORPlot data set, which was created by the ODS OUTPUT statement during the first call to PROC LOGISTIC. The names of the data set variables are self-explanatory:

title "Odds Ratios with 95% Wald Confidence Limits";
proc sgplot data=ORPlot noautolegend;
   scatter y=Effect x=OddsRatioEst / xerrorlower=LowerCL xerrorupper=UpperCL
           markerattrs=(symbol=diamondfilled);
   refline 1 / axis=x;
   xaxis grid type=log label="Odds Ratio (log scale)";  /* specify log scale */
   yaxis grid display=(nolabel) discreteorder=data reverse;
run;

The graph is essentially the same as the one produced by PROC LOGISTIC and is not shown. This same technique can be used to create forest plots in SAS.

Create an odds ratio plot with a log scale? You decide!

Should you use a log-scale axis for an odds ratio plot? It depends on your target audience. I recommend it when you are presenting results to a mathematically sophisticated audience.

For other audiences, it is less clear whether the advantages of a log scale outweigh the disadvantages. Odds ratio plots are used heavily in medical research, such as to report the results of clinical trials. Is the average medical practitioner and healthcare administrator comfortable enough with log scales to justify using them?

What do you think? If you use odds ratio plots in your work, which version do you prefer? SAS software can easily produce both kinds of odds ratio plots, so the choice is yours.

Post a Comment

Convert a vector to a string

Base SAS contains many functions for processing strings, and you can call these functions from within a SAS/IML program. However, sometimes a SAS/IML programmer needs to process a vector of strings. No problem! You can call most Base SAS functions with a vector of parameters.

I have previously written about how to convert a string into a character vector, but sometimes it is useful to do the reverse operation. Specifically, it is sometimes useful to create a single string from the elements in a SAS/IML character vector or matrix. The following SAS/IML module concatenates all of the elements in a SAS/IML matrix into a single string, stripping out blank characters at the beginning or end of each element. The module contains an optional argument that you can use to specify a delimiter between words:

proc iml;
/* Given a character matrix, this module returns a string with 
   the N elements separated by spaces or a user-specified delimiter. */
start MakeString(vv, delim=" ");
   v = rowvec(vv);
   if type(v)='N' then v = putn(v,"BEST6."); /* convert numbers to char */
   N = ncol(v);
   s = "";
   do i = 1 to N-1;
      s = s + strip(v[i]) + delim;
   end;
   return ( s + strip(v[N]) );
finish;
 
/* Test the module. You might want to use ODS LISTING for printing because 
   HTML does not display multiple blank characters */
varNames = "x1":"x5";
s1 = MakeString(varNames);
x = {"I" "think" "that" "I" "shall" "never" "see"};
s2 = MakeString(x, "/");
colors = {"Red" "Blue" "Green"};
s3 = MakeString(colors, ", ");
num = 1:5;
s4 = MakeString(num, ",");
print s1, s2, s3, s4;
vectortostring

It is difficult to eliminate the DO loop in the module and truly vectorize this process. The problem is that the SAS/IML language uses fixed-length arrays, so every time you try to trim a vector or extract substrings, the resulting IML vector is re-padded with blanks!

However, if the strings all have the same prefix, as in the varNames example, then you can write some special code that uses the prefix information to eliminate the loop. The ROWCATC function concatenates the elements into a single string without any blanks. The TRANWRD function in Base SAS replaces each prefix "x" with " x", which adds a space between the variable names in this special case:

/* ROWCATC concatenates into a single string without blanks;
   Use TRANWRD to put blank before each "x" character */
s5 = strip(tranwrd(rowcatc(VarNames), "x", " x"));

I've used this "vector to string" function several times, most recently when I was reading a parameter estimates table that was produced by a SAS procedure. I needed to print a single string that contained the name of the variables in the model, and this module made it easy.

Post a Comment

Wealth and winning in NC high school athletics

The Raleigh News & Observer published a front-page article about the effect of wealth and poverty on high school athletics in North Carolina. In particular, the article concluded that "high schools with a high percentage of poor students rarely win titles in the so-called country club sports—tennis, golf and swimming—and the number of sports in which the more affluent dominate is growing."

The article analyzed state championships from 2001 through spring 2013 and concluded that "the percentage of students receiving free or reduced-price meals, which is an indication of students living in need, is a predictor of high school athletic success." The authors presented data on the number of state titles won in 10 sports, including baseball, men's basketball, football, volleyball, and the "country-club sports" for men and women. In percentage terms, the published results were as follows:

nchssports

According to these statistics, almost 36% of state titles over 13 years were won by relatively affluent schools: those for which less than 20% of the students are eligible for free or reduced meals. Another 33% of titles were won by moderately affluent schools that had between 20% and 40% of "needy" students. In contrast only 1% of titles were won by high schools for which 80% or more of the students are needy.

Upon first glance, these numbers seem to back up the author's claims. But do they? The story leaves out an important fact: what is the percentage of NC high schools in each of those five categories? After all, if 36% percent of the NC schools are affluent and only 1% are impoverished, these numbers merely reflect the distribution of schools within the state.

The NC Department of Public Education provides data for the 2013-2014 school year on free and reduced meals at all NC public schools. I restricted the list to high schools and used SAS software to tabulate the percentage of public schools in each category. You can download the CSV file with the school-level data and the SAS program that I used to analyze the data. The following table summarizes the results.

nchssports2

In 2013, only 5% of public high schools in NC were "affluent," as determined by having 20% or fewer students qualify for free or reduced meals. In contrast, 8% of NC high schools had 80% or more of the students qualify.

The data indicate that the advantage of wealth (and disadvantage due to poverty) is even more pronounced than the article indicates. The affluent schools won 613% more state titles than might be predicted based on their relative numbers. The schools with the highest percentage of poor students won 87% fewer titles than their relative numbers would indicate.

In short, affluent schools win many more state titles in NC than their relative numbers might suggest, and the percentage of students receiving free or reduced-price meals is a highly significant factor in predicting state athletic titles in the 10 sports listed by the N&O.

Of course, the situation is more complex than merely counting the number of schools in each category. Larger schools have more students to draw from. Suburban schools might be located closer to training facilities that offer off-season clubs and clinics. The more affluent schools might be able to recruit and retain better coaches.

With more data and more statistics, you could build better statistical models, but I think the results are clear. For NC high school athletics, wealth and winning are highly correlated.

Post a Comment

The relationship between toothlessness and income

My colleague Robert Allison finds the most interesting data sets to visualize! Yesterday he posted a visualization of toothless seniors in the US. More precisely, he created graphs that show the estimated prevalence of adults (65 years or older) who have had all their natural teeth extracted. The dental profession calls these people edentulous. According to the CDC, about 20% of seniors (more than 35 million Americans) are edentulous.

When I looked at his sorted bar chart, I noticed that the states that had the most toothless seniors seemed to be poorer states such as West Virginia, Kentucky, Tennessee, Mississippi, and Louisiana. In contrast, richer states such as Colorado, Connecticut, and Hawaii had a relatively small number of toothless seniors. I wondered whether there was a correlation between median income in a state and the number of edentulous individuals.

Rob always publishes his SAS programs, so I used his data and merged it with the state median household income (2-year-average medians) as reported by the US Census Bureau. Then I used the SGPLOT procedure to plot the incidence of toothlessness (with lower and upper confidence intervals) versus the median state incomes. I also added a loess curve as a visual smoother to the data, as follows:

title "All Teeth Extracted vs. Median Income";
proc sgplot data=all;
scatter x=income y=pct / yerrorlower=LCL yerrorupper=UCL datalabel=N;
loess x=income y=pct;
run;
teeth1

The resulting graph (click to enlarge) shows a strong linear correlation (ρ = -0.63) between the data, and the loess curve indicates that the relationship is stronger for states in which the median income is less than $50,000. The confidence intervals indicate that most of the data is well approximated by the loess fit, but there are a few outliers.

Two states in the upper left corner of the graph (West Virginia and Kentucky) have incidences of edentulous that are much higher than suggested by the model that uses only median household income. Several states—including Montana, Florida, and Hawaii—have a much lower incidence of tooth extraction. For easy identification of the states on the scatter plot, you can create a second scatter plot that does not contain the confidence limits and instead displays the state names as labels.

teeth2

Like Rob, I always post the full SAS code that creates the analyses and graphs in my blog posts, so feel free to play with the data and create more visualizations.

And regardless of your income or state of residence, brush your teeth!

Post a Comment

A new method to simulate the triangular distribution

triangulardist

The triangular distribution has applications in risk analysis and reliability analysis. It is also a useful theoretical tool because of its simplicity. Its density function is piecewise linear. The standardized distribution is defined on [0,1] and has one parameter, 0 ≤ c ≤ 1, which determines the peak of the density. The histogram at left shows 10,000 random draws from a triangular distribution with c=0.25, overlaid with the triangular density function.

Because the triangular distribution is so simple, you can write down an explicit formula for its cumulative and inverse cumulative distribution function. The density is piecewise linear, so the cumulative distribution function (CDF) is piecewise quadratic. The inverse CDF is expressed in terms of a piecewise function that contains square roots.

I assumed that everything about this simple distribution has been known for 100 years. Therefore I was surprised to discover a 2009 paper titled "A new method to simulate the triangular distribution" (Stein and Keblis).

In the short paper, the authors note that the triangular distribution can be expressed as a linear combination of two other densities: a linear increasing density on [0,1] and a linear decreasing density on [0,1]. They then note that the first density is the density for the random variable MIN(u, v), where u and v are independent uniform random variates. Similarly, the second density is the density for the random variable MAX(u, v). Consequently, they realized that you can simulate a random variate from the triangular distribution by generating u and v and setting x = (1-c)*MIN(u,v) + c*MAX(u,v). They call this the MINMAX method for generating x.

The RAND function in the DATA step and the RANDGEN function in SAS/IML have built-in support for the triangular distribution. Nevertheless, it is fun to implement new algorithms, so let's compare the MINMAX algorithm by Stein and Keblis to the traditional simulation algorithm, which is the inverse CDF method.

You can use the DATA step to implement the MINMAX algorithm, but I will use SAS/IML because it is easier to compare the performance of algorithms. The following program computes one million random variates from the triangular distribution in three ways: by calling the RANDGEN subroutine, by using the SAS/IML language to implement the inverse CDF algorithm, and by implementing the MINMAX algorithm by Stein and Keblis.

proc iml;
call randseed(123456);
N = 1e6;
c = 0.25;
time = j(1,3,.);
 
/* Built-in method: Call RAND or RANDGEN directly */
t0 = time();
   x = j(N,1);
   call randgen(x, "Triangle", c);
time[1] = time()-t0;
 
/* Inverse CDF method */
t0 = time();
   u = j(N,1);
   call randgen(u, "Uniform");
   x = choose(u<=c, sqrt(c*u), 1 - sqrt((1-c)*(1-u)));
time[2] = time()-t0;
 
/* MINMAX method: Stein and Keblis (2009) */
t0 = time();
   u = j(N,2);
   call randgen(u, "Uniform");
   x = (1-c)*u[,><] + c*u[,<>];       /* (1-c)MIN + c*MAX */
time[3] = time()-t0;
 
print N[F=comma9.] time[c={"Rand" "InvCDF" "MinMax"} F=6.4 L=""];
t_triangulardist

The table shows that you can simulate one million random variates from the triangular distribution in 0.02 seconds by calling the RANDGEN subroutine. If you implement the inverse CDF method "by hand," it takes about 0.06 seconds. If you implement the MINMAX algorithm of Stein and Keblis, that algorithm also takes about 0.06 seconds.

You can draw a few conclusions from this exercise:

  • The SAS/IML language is fast. This is why researchers often use SAS/IML for simulation studies. In my book Simulating Data with SAS I show how to implement simulation studies efficiently in SAS/IML. Even the most complicated simulation in the book completes in less than 30 seconds.
  • SAS/IML is versatile. It is easy to implement new algorithms in the language.
  • SAS/IML is compact. As a matrix language, you can implement an entire algorithm with a few SAS/IML statements.
  • SAS/IML is convenient. Not only was I able to implement two different simulation algorithms, but I was able to easily compare the performance of the algorithms.

I guess another lesson from this exercise is that there is always something new to learn...even about a simple distribution that has been studied for hundreds of years!

Post a Comment

Create a density curve with shaded tails

A SAS programmer wanted to plot the normal distribution and highlight the area under curve that corresponds to the tails of the distribution. For example, the following plot shows the lower decile shaded in blue and the upper decile shaded in red.

colordensitytails

An easy way to do this in SAS is to use the BAND statement in PROC SGPLOT. The same technique enables you to shade the area under a curve between two values, such as the central 80% of a distribution or the domain of integration for an integral.

For a parametric distribution such as the standard normal distribution, you can use a DATA step to evaluate the probability density function (PDF) at evenly spaced X values. The plot of the PDF versus X is the density curve. To use the BAND statement, you need to compute two other variables. The left tail consists of the set of X and PDF values for which X is less than some cutoff. For example, the lower decile cutoff is -1.28155, which is the value xlow for which Φ(xlow) = 0.1, where Φ is the standard normal cumulative distribution function (CDF). The following DATA step uses the QUANTILE function to compute the quantile (inverse CDF) values. (To remind yourself what these probability functions do, see the article "Four essential functions for statistical programmers.")

data density;
low  = quantile("Normal", 0.10);   /* 10th percentile = lower decile */
high = quantile("Normal", 0.90);   /* 90th percentile = upper decile */
drop low high;
/* if step size is dx, location of quantile can be off by dx/2 */
do x = -4 to 4 by 0.01;            /* for x in [xMin, xMax] */
   pdf = pdf("Normal", x);         /* call PDF to get density */
   if x<=low  then up1 = pdf; else up1 = .;
   if x>=high then up2 = pdf; else up2 = .;
   output;
end;
run; 
 
title "Normal Density with Colored Deciles";
proc sgplot data=density;
   series x=x y=pdf / legendlabel="Normal Density";
   band x=x upper=up1 lower=0 / fillattrs=GraphData1 legendlabel="Lower Decile";
   band x=x upper=up2 lower=0 / fillattrs=GraphData2 legendlabel="Upper Decile";
run;

The density curve with shaded tails was shown previously. The LOWER= option in the BAND statement sets the lower part of the colored bands to 0.

One subtlety is worth mentioning. In the DATA step, a DO loop is used to evaluate the density at points separated by 0.01. That means that the shaded regions have some inaccuracy. For example, the left tail in the image appears at x = -1.28, rather than at the exact value x = -1.28155. If more accuracy is required, you can evaluate the PDF at the exact quantile value and then sort the data set prior to graphing.

Shading quantiles of a kernel density estimate

The same technique can be used to color the tails of a nonparametric curve, such as a kernel density estimate. For example, suppose you want to compute a kernel density estimate for the Systolic variable in the Sashelp.Heart data set. The variable records the systolic blood pressures of 5,209 patients. You can use the OUTKERNEL= option in the HISTOGRAM statement of PROC UNIVARIATE writes the kernel density estimate to a data set, as follows:

proc univariate data=Sashelp.Heart;
   var systolic;
   histogram systolic / kernel outkernel=KerDensity; 
   ods select histogram quantiles;
run;

The output from the procedure (not shown) shows that the 10th percentile of the systolic values is 112, and the 90th percentile is 168. The output data set (called KerDensity) contains six variables. The _VALUE_ variable represents the X locations for the density curve and the _DENSITY_ variable represents the Y locations. You can use the empirical estimate of quantiles to set the cutoff value and highlight the upper and lower tails as follows:

%let lowerX = 112;   /* empirical 10th percentile */
%let upperX = 168;   /* empirical 90th percentile */
data KerDensity;
   set KerDensity;
   if _VALUE_<&lowerX then up1 = _DENSITY_; else up1 = .;
   if _VALUE_>&upperX then up2 = _DENSITY_; else up2 = .;
run; 
 
title "Kernel Density with Colored Deciles";
proc sgplot data=KerDensity;
   series x=_VALUE_ y=_DENSITY_ / legendlabel="Density";
   band x=_VALUE_ upper=up1 lower=0 / fillattrs=GraphData1 legendlabel="Lower Decile";
   band x=_VALUE_ upper=up2 lower=0 / fillattrs=GraphData2 legendlabel="Upper Decile";
   xaxis label="Systolic Blood Pressure" minorcount=4 minor;
run;
colordensitytails2

Notice that except for the names of the variables, the PROC SGPLOT statements are the same for the parametric and nonparametric densities.

Although this example uses deciles to set the cutoff values, you could also use medically relevant values. For example, you might want to use red to indicate hypertension: a systolic blood pressure that is greater than 140. On the low side, you might want to highlight hypotensive patients who have a systolic blood pressure less than 90.

Post a Comment

Visualizing the distribution of ACT scores

My son is in high school and plans to take the ACT, a standardized test to assess college aptitude and readiness. My wife asked, "What is a good score for the ACT?" I didn't know, but I did a quick internet search and discovered a tabulation of scores for the 1.66 million students who took the ACT in 2012. The data were presented for four "content areas," namely English, mathematics, reading, and science. A score is a number between 1 and 36. The four content areas are combined to create a composite score.

To answer my wife's question, I used PROC MEANS in SAS to compute the following table, which contains summary statistics for ACT scores for the four content areas and for the composite scores. You can see that an average score for the ACT is about 20 or 21. A score of 25 of more will land you in the top quartile (25%) of students. To be in the top decile (10%), you need a score in the high 20s, or even 30.

t_act

The data were in a report that included a graph that tried to show the distribution of scores. However, I did not find the graph to be very effective. I thought I'd use SAS to visualize the distribution of ACT scores in a form that might be clearer. You can download the data and SAS program that I used to create the plots.

For a first attempt, I used PROC SGPANEL to create a panel with four bar graphs, one for each subject area. The published graph aggregated scores into bins, so I did the same. You can apply a custom SAS format and use PROC FREQ to aggregate the scores. The following graph contains the same information as the published graph, but shows the distribution of scores more clearly by using a single bar chart for each subject area:

act2

The panel displays four graphs. The first row shows the distribution of aggregated scores for the English content area. About 25% of student scores are in the range 20–23. Only 14% of students scored 28 or higher. The distributions of reading scores is similar.

The distribution of math scores (the graph in the second row) is more interesting. This seems to be a bimodal distribution. About 45% of students have scores less than 20. About one quarter of students score in the 24–27 range. It is curious that almost no one scores lower than 13 in math, although low scores exist in the other content areas.

The bar charts for reading and science occupy the third and fourth rows, respectively. The most obvious feature is that the distribution of science scores is narrow, with only 9% of students scoring 28 or better.

An alternative visualization of ACT scores

The aggregation of scores into bins is not necessary. You can also plot the distribution of the raw scores, as shown below.

act3

I like this variation. The 36 bins show the exact distribution of the scores. You can easily see the bimodal nature of the math scores. You can see that the reading and English scores have a relatively large standard deviation, with many students scoring very high or very low, especially when compared to the science scores.

By drawing a histogram, you can also add "benchmark scores" or other reference values to the distributions. According to the ACT report (p. 3): "A benchmark score is the minimum score needed on an ACT subject-area test to indicate a 50% chance of obtaining a B or higher or about a 75% chance of obtaining a C or higher in the corresponding credit-bearing college courses, which include English Composition, Algebra, Social Science and Biology."

By adding the benchmark scores, you can visually estimate what proportion of the ACT students are likely to get at least a B in each subject area, thus making a connection between ACT scores and college performance. For English, about two thirds of students have at least a 50% chance of getting a B or higher. The benchmark score for English is relatively low compared to the median. In contrast, the benchmarks for math and reading are about the same as the median scores in those subject areas, indicating that about half of the students have a 50% chance of getting a B or higher. Apparently college-level Biology is difficult for students because the benchmark score for science (24) is relatively high. Only 30% of students that take the ACT have a science score greater than or equal to the benchmark.

Which visualization do you prefer? Or maybe you have a different idea that you'd like to try? Leave a comment.

Post a Comment

How to Winsorize data in SAS

Recently a SAS customer asked how to Winsorize data in SAS. Winsorization is best known as a way to construct robust univariate statistics. The Winsorized mean is a robust estimate of location.

The Winsorized mean is similar to the trimmed mean, and both are described in the documentation for PROC UNIVARIATE. Both statistics require that you specify an integer k. For the trimmed mean, you exclude the smallest and largest k nonmissing values and take the mean of the remaining values. Thus for a variable with n observations, the trimmed mean is the mean of the central n – 2k values.

In contrast, when you Winsorize data you replace the k smallest values with the (k+1)st ordered value and you replace the k largest values with the (n–k)th largest value. You then take the mean of the new n observations.

Winsorize data in SAS

In a 2010 paper I described how to use SAS/IML software to trim data. Trimming is the act of truncating the upper and lower tails of the empirical distribution of the data.

Winsorizing is slightly more complicated, especially if the data contain missing values or repeated values. You can sort the data, but sorting puts missing values first, which makes some computations more challenging. Instead, the following code uses the RANK function to compute the rank of the data values. The values with ranks less than or equal to k are then replaced, and similarly for the values with the k largest ranks:

%let DSName = sashelp.heart;
proc iml;
/* SAS/IML module to Winsorize each column of a matrix. 
   Input proportion of observations to Winsorize: prop < 0.5. 
   Ex:  y = Winsorize(x, 0.1) computes the two-side 10% Winsorized data */
start Winsorize(x, prop);
   p = ncol(x);            /* number of columns */
   w = x;                  /* copy of x */
   do i = 1 to p;
      z = x[,i];           /* copy i_th column */
      n = countn(z);       /* count nonmissing values */
      k = ceil(prop*n);    /* number of obs to trim from each tail */
      r = rank(z);         /* rank values in i_th column */
      /* find target values and obs with smaller/larger values */
      lowIdx = loc(r<=k & r^=.);
      lowVal = z[loc(r=k+1)]; 
      highIdx = loc(r>=n-k+1);
      highVal = z[loc(r=n-k)]; 
      /* Winsorize (replace) k smallest and k largest values */
      w[lowIdx,i] = lowVal;
      w[highIdx,i] = highVal;
   end;
   return(w);
finish;
 
/* test the algorithm on numerical vars in a data set */
use &DSName;
read all var _NUM_ into X[colname=varNames];
close;
winX = Winsorize(X, 0.1);

The matrix winX contains the Winsorized data, where the extreme values in each column have been replaced by a less extreme value. (If you want to print the Winsorized data, use %let DSName = sashelp.class;, which is a small data set.) To verify that the data are Winsorized correctly, you can compute the Winsorized means in SAS/IML and compare them to the Winsorized means that are computed by PROC UNIVARIATE. The SAS/IML computation is simply the mean of the Winsorized data:

/* Compute Winsorized mean, which is  mean of the Winsorized data */
winMean = mean(winX);
print winMean[c=varNames f=8.4];
t_winsorize

With this data you can compute many robust statistics, such as the Winsorized standard deviation or the Winsorized covariance or correlation matrix. You can even compute t tests for a Winsorized mean.

As validation, the following call to PROC UNIVARIATE computes the Winsorized means for each of the numeric variables in the &DSName data set. The results are not shown, but are equivalent to the SAS/IML computations:

/* Validation: compute Winsorized means by using UNIVARIATE */
ods exclude all;
proc univariate data=&dsname winsorized=0.1;
   ods output WinsorizedMeans=winMeans;
run;
ods exclude none;
 
proc print data=winMeans;
   var VarName Mean;
run;

Unsymmetric Winsorization

The symmetric Winsorization results in a Winsorized mean that has nice theoretical properties. In particular, John Tukey and colleagues derived standard errors, confidence intervals, and other distributional properties for the Winsorized mean. These inferential statistics are computed by PROC UNIVARIATE.

Some software enables you to "Winsorize" data in an unsymmetric manner. Specifically, you can specify quantiles α < 0.5 and β > 0.5 and the software will replace values x < x(α) with x(α) and values x > x(β) with x(β), where x(α) is the value of the αth quantile. You can use the QNTL subroutine in SAS/IML to carry out this computation, or you can use a SAS macro.

However, I do not know whether the distributions of the resulting statistics are known. The interested reader can use a search engine such as Google Scholar to search for "asymmetric Winsorized means." For symmetric distributions, I recommend the classic symmetric Winsorization.

Post a Comment