Compute the number of digits in an integer

The title of this blog post might seem strange, but I occasionally need to compute the number of digits in a number, usually because I am trying to stuff an integer value into a string. Each time, I have to derive the formula from scratch, so I am writing this article so that I and others can look up the formula next time we need it.

The problem: You have a positive integer, n, which has k digits. How can you efficiently find k as a function of n?

The solution is to use logarithms. Because n has k digits,
10k-1n < 10k, or
10k-1 < n+1 ≤ 10k.

Applying the base-10 logarithm produces
k-1 < log10(n+1) ≤ k.

The CEIL function in Base SAS computes the integer value greater or equal to its argument, which proves that
k = ceil( log10(n+1) ).

A canonical example, is n=1234, which has k=4 digits. How can you find k from n? The following program computes k:

data _null_;
n = 1234;      /* positive integer */
k = ceil(log10(n+1)); 
put "The integer " n= "has " k= "digits.";
run;
The integer n=1234 has k=4 digits.

If you have a negative integer m ≤ -1 and the absolute value has k digits, then m requires a string with k+1 elements, because one element is needed for the negative sign.

Incidentally, there is nothing special about base 10. The proof generalizes to compute the number of digits required to represent a number in any base. For example, the number n=1234 written in base 2 requires k = ceil(log2(n+1)) = 11 digits. You can check that the 11-digit binary number 10011010010 equals the decimal value 1234.

You can also use these mathematical ideas to compute the next power of 2 (or power of 10) that is larger than a given number.

Post a Comment

Monitor convergence during simulation studies in SAS

Ugh! Your favorite regression procedure just printed a warning to the SAS log. Something is wrong, and your attempt to fit a model to the data has not succeeded. A typical message is "WARNING: The validity of the model fit is questionable," perhaps followed by some additional diagnostic messages about "quasi-separation" or "lack of convergence."

If your modeling toolkit includes procedures such as LOGISTIC, GENMOD, MIXED, NLIN, or PHREG, you might have experienced convergence problems. A small sample size or a misspecified model are among the reasons for lack of convergence. There are many papers that discuss how to handle convergence issues. Paul Allison (2008) wrote a paper on some reasons that a logistic model might fail to converge, including an explanation of quasi-complete separation. The documentation for the MIXED procedure includes a long list of potential reasons that a mixed model might fail to converge. Several of my friends from SAS Technical Support give advice on convergence in their SAS Global Forum paper (Kiernan, Tao, and Gibbs, 2012).

Although it can be frustrating to deal with convergence issues during a data analysis, lack-of-convergence during a simulation study can be maddening. Recall that the efficient way to implement a simulation study in SAS is to use BY-group processing to analyze thousands of independent samples. A simulation study is designed to run in batch mode without any human intervention. How, then, can the programmer deal with lack of convergence during a simulation?

Some SAS users (such as Chen and Dong, 2009) have suggested parsing the SAS log for notes and warning messages, but that approach is cumbersome. Furthermore, if you have turned off SAS notes during the simulation, then there are no notes in the log to parse!

The _STATUS_ variable in the OUTEST= data set

Fortunately, SAS procedures that perform nonlinear optimization provide diagnostic variables as part of their output. These variables are informally known as "status variables." You can monitor the status variables to determine the BY groups for which the optimization converged.

The easiest way to generate a status variable is to use the OUTEST= option to generate an output data set that contains parameter estimates. Not every procedure supports the OUTEST= option, but many do. Let's see how it works. In my article "Simulate many samples from a logistic regression model," I showed how to generate 100 samples of data that follow a logistic regression model. If you make the sample size very small (like N=20), PROC LOGISTIC will report convergence problems for some of the random samples, as follows:

%let N = 20;                                     /* N = sample size */
%let NumSamples = 100;                           /* number of samples */
/* Generate logistic data: See http://blogs.sas.com/content/iml/?p=11735 */
 
options nonotes;    /* turn of notes; use OUTEST= option */
proc logistic data=LogisticData noprint outest=PE;
   by SampleId;
   model y(Event='1') = x1 x2;
run;
options notes;
 
proc freq data=PE;
tables _STATUS_ / nocum;
run;
t_simconverge1

The output from PROC FREQ shows that 10% of the models did not converge. The OUTEST= data set has a variable named _STATUS_ that indicates whether the logistic regression algorithm converged. You can use the _STATUS_ variable to analyze only those parameter estimates from converged optimizations. For example, the following call to PROC MEANS produces summary statistics for the three parameter estimates in the model, but only for the converged models:

proc means data=PE nolabels;
where _STATUS_ = "0 Converged"     /* analyze estimates for converged models */
var Intercept x1 x2;
run;
t_simconverge2

The ConvergenceStatus table

Notice that the _STATUS_ variable does not contain details about why the algorithm failed to converge. For that information, you can use the ConvergenceStatus ODS table, which is produced by all SAS/STAT regression procedures that involve nonlinear optimization. To save the ODS table to an output data set, you cannot use the NOPRINT option. Instead, use the %ODSOff and %ODSOn macros to suppress ODS output. Use the ODS OUTPUT statement to write the ConvergenceStatus tables to a SAS data set, as follows:

/* define %ODSOff and %ODSOn macros */
%ODSOff
proc logistic data=LogisticData;
   by SampleId;
   model y(Event='1') = x1 x2;
   ods output ParameterEstimates=PE2 ConvergenceStatus=CS;
run;
%ODSOn
 
proc freq data=CS;
tables Status Reason / nocum;
run;
t_simconverge3

The ConvergenceStatus table has a numerical variable named Status, which has the value 0 if the algorithm converged. The character variable Reason contains a short description of the reason that the algorithm terminated. The results of the PROC FREQ call shows that the logistic algorithm failed eight times because of "complete separation," converged 90 times, and failed twice due to "quasi-complete separation."

You can use the DATA step and the MERGE statement to merge the ConvergenceStatus information with the results of other ODS tables from the same analysis. For example, the following statements merge the convergence information with the ParameterEstimates table. PROC MEANS can be used to analyze the parameter estimates for the models that converged. The summary statistics are identical to the previous results and are not shown.

data All;
merge PE2 CS;
by SampleID;
run;
 
proc means data=All;
where Status = 0;
class Variable;
var Estimate;
run;

In conclusion, the ConvergenceStatus table provides information about the convergence for each BY group analysis. When used as part of a simulation study, you can use the ConvergenceStatus table to manage the analysis of the simulated data. You can count the number of samples that did not converge, you can tabulate the reasons for nonconvergence, and you can exclude nonconverged estimates from your Monte Carlo summaries.

Post a Comment

Video: Ten tips for simulating data with SAS

One of my presentations at SAS Global Forum 2015 was titled "Ten Tips for Simulating Data with SAS". The paper was published in the conference proceedings several months ago, but I recently recorded a short video that gives an overview of the 10 tips:



If your browser does not support embedded video, you can go directly to the video on YouTube.

The tips are based on material in my book Simulating Data with SAS, which contains hundreds of tips and techniques for simulating data, and several tips have been published as articles in my blog.

Post a Comment

She wants to be an airborne ranger

I wanna be an airborne ranger,
Live the life of guts and danger.*

If you are an 80's movie buff, you might remember the scene in The Breakfast Club where Bender, the juvenile delinquent played by Judd Nelson, distracts the principal by running through the school singing this song. Recently, the US Army Ranger School has been in the news because, for the first time, two women passed the rigorous training regime.

It is extremely difficult for anyone, regardless of gender, to pass the course. Only 94 out of 381 males passed this year's course, and only two of the original 19 females graduated.

Those numbers imply that that approximately 20% of males graduated, whereas only 10% of females graduated. However, the initial female group was so small that I wondered whether this difference was statistically significant. Do these data indicate that there is a higher probability of passing the ranger course if you are male?

It turns out that the difference is not statistically significant. You can use PROC FREQ in Base SAS to analyze the data, as follows:

data Rangers;
input Gender $ Finished $ Count;
datalines;
M  Yes  94
M  No  381
F  Yes   2
F  No   19
;
 
proc Freq data=Rangers;
   weight Count;
   tables Gender * Finished / nocol nopercent 
          chisq 
          or(cl=score)
          riskdiff(cl=score column=2 norisks);
run;
t_ranger1

The first row of the crosstabulation shows the counts and percentages of women who failed or passed the ranger course. The second row presents the counts and percentages for men. The first column shows that about 90% of the women failed the course, as compared with 80% of the men.

The options on the TABLES statement request several statistical tests. The CHISQ option requests a chi-square test for association between gender and finishing the course. Because one cell has fewer than 5 counts, it is best to look at the exact Fisher test.

t_ranger4

The p-value is not small, so there is insufficient evidence to reject the hypothesis that finishing the course is independent of gender. The other statistical tests (not shown) show similar results. These data do not suggest that gender significantly affects the odds of failing or the risk of failing. The confidence interval for the odds ratio includes 1; the confidence interval for the risk difference includes 0.

To obtain statistical significance, a larger cadre of women would need to be in the study. With the US Army re-evaluating the role of women in combat positions, maybe that will occur in the near future.

For now, congratulations to Capt. Kristen Griest and 1st Lt. Shaye Haver, the hard-working women who made history by passing the ranger course, and to all the women who attempted. The data indicate that the pass rate between men and women is not statistically significant. I think that is quite an achievement.

* The second line of this cadence has many variations, including some that are not safe for work.

Post a Comment

Correlations between groups of variables

Typically a correlation analysis reports the correlations between all pairs of variables, including the variables with themselves. The resulting correlation matrix is square, symmetric, and has 1s on the main diagonal. But suppose you are interested in only specific combinations of variables. Perhaps you want the pairwise correlations between one group that contains n1 variables and another group of n2 variables.

No worries. The CORR procedure in Base SAS supports the WITH statement which enables you to compute and display a subset of the correlations between variables.

Correlations of certain variables WITH other variables

Often variables naturally fall into groups. For example, in the Sashelp.Cars data, the variables are related to four groups: price, power of the engine, fuel economy, and size of vehicle. In a medical study you might have certain variables related to physical characteristics that will not be affected by the study (age, height, gender) and other variables that might be affected (blood pressure, cholesterol, weight).

In many cases, you are interested in the correlations between the groups. For example, the following PROC CORR analysis uses the VAR statement to specify four variables in one group and the WITH statement to specify three variables in another group. The procedure computes the 12 correlations between the two sets of variables:

/* default handling of missing values is pairwise */
proc corr data=Sashelp.Heart noprob nosimple; 
var AgeCHDDiag Height Weight MRW;
with Diastolic Systolic Cholesterol;
run;
t_corrwith1

Several variables in the Sashelp.Heart data set contain missing values. The table shows the correlations (top of each row) and the number of nonmissing values (bottom of each row) that are used to compute each correlation. The default handling of missing values in PROC CORR is pairwise exclusion, which is why different cells in the table report different numbers of nonmissing values.

The same computation in SAS/IML

This article was motivated by a question that a SAS customer asked on a discussion forum. The question was "How can you do this computation in SAS/IML?" The customer noted that the CORR function in SAS/IML accepts only a single data matrix.

Nevertheless, you can use the CORR function for this computation because the pairwise correlations that you want are a submatrix of the full correlation matrix. That means that you can compute the full correlation matrix (with pairwise exclusion of missing values) and use array indices to extract the relevant submatrix, as follows:

proc iml;
varNames = {"AgeCHDDiag" "Height" "Weight" "MRW"};
withNames = {"Diastolic" "Systolic" "Cholesterol"};
names = varNames || withNames;
use Sashelp.Heart;        
read all var names into X;               /* read VAR and WITH variables */
close;
 
corr = corr(X, "pearson", "pairwise");   /* compute for correlations */
vIdx = loc( element(names, varNames) );  /* columns of 1st set of variables */
wIdx = loc( element(names, withNames) ); /* columns of 2nd set of variables */
withCorr = corr[wIdx, vIdx];             /* extract submatrix */
print withcorr[r=withNames c=varNames];
t_corrwith2

Notice that the column indices for the VAR and WITH variables are computed by using the LOC-ELEMENT technique, although for this example the columns are simply 1:4 and 5:7. The correlations are the same as were produced by PROC CORR.

However, you might be concerned that we are doing more work than is necessary. The CORR function computes the entire 7x7 matrix of correlations, which requires computing 28 pairwise correlations, when all we really need is 12. The computation is even less efficient if the VAR and WITH lists are vastly different in size, such as one variable in the first list and six in the second.

To be more efficient, you might want to compute only the correlations that you really need. The following module computes only the pairwise correlations that you specify:

/* Compute correlations between groups of variables. Use pairwise exclusion 
   of missing values. Specify the data matrix X and indices for the 
   columns that correspond to the VAR and WITH variables. */
start CorrWith(X, varIdx, withIdx, method="Pearson");
   withCorr = j(ncol(withIdx), ncol(varIdx));
   do i = 1 to ncol(withIdx);
      w = withIdx[i];               /* column for WITH var */
      do j = 1 to ncol(varIdx);
         v = varIdx[j];             /* column for VAR var */
         withCorr[i,j] = corr(X[,v||w], method, "pairwise")[1,2];
      end;
   end;
   return(withCorr);
finish;
 
withCorr = CorrWith(X, vIdx, wIdx);
print withcorr[r=withNames c=varNames];

The output is identical to the previous output, and is not shown.

Post a Comment

Create heat maps with PROC SGPLOT

When SAS 9.4m3 was released last month (including SAS/STAT and SAS/IML 14.1), I was happy to see that a HEATMAP statement had been added to the SGPLOT procedure. Although heat maps in the SAS/IML language have been available for several releases, you previously had to use the Graph Template Language (GTL) to create a customized heat map in Base SAS.

To illustrate using the HEATMAP statement, consider the following counts for patients in a medical study, classified by ordinal variables that represent the patients' weight and smoking status. The data and the techniques used to create the table are discussed in the article "How to order categories in a two-way table with PROC FREQ." You can download the complete SAS program that is used to create the table and graphs in this analysis.

t_orderfreq2

The categories and counts are contained in a data set called FreqOut, which was created by the FREQ procedure. The data set is arranged in "list form" (or "long form") as follows:

t_heatmap1

A basic heat map of a frequency table

The following statements create a basic heat map of the frequencies for the 15 cells of the table:

proc sgplot data=FreqOut;
   heatmap x=Weight_Cat y=Smoking_Cat / freq=Count 
         discretex discretey
         colormodel=TwoColorRamp outline;
run;
heatmapstmt1

On the HEATMAP statement, the DISCRETEX and DISCRETEY options are used because the underlying variables are numeric; a user-defined format is used to render the numbers as "Underweight," "Normal," "Overweight," and so forth. A two-color gradient ramp is good for showing counts. The default color ramp has three colors (blue, white, and red), which is good for visualizing deviations from a reference value.

The heat map makes it visually clear that most patients in the study are overweight non-smokers. Other categories with many patients include the overweight heavy-smokers, the normal-weight non-smokers, and the normal-weight heavy-smokers. Underweight patients (regardless of smoking status) are relatively rare.

Overlaying text on a heat map

Although this same heat map can be created in SAS/IML by using the HEATMAPCONT subroutine, using PROC SGPLOT makes it easy to overlay other plot types and change properties of the graph. For example, you might want to overlay the counts of each cell to create a color-coded table. You can easily add the TEXT statement to the previous SGPLOT statements. If you add the counts, you don't need to include the gradient color ramp. Furthermore, to make the heat map look more like the table, you can add the REVERSE option to the YAXIS statement. Lastly, the category labels make it clear that the variables are related to smoking and weight, so you can suppress the variable names. The following statements implement these revisions and create a new heat map:

proc sgplot data=FreqOut noautolegend;
   heatmap x=Weight_Cat y=Smoking_Cat / freq=Count 
         discretex discretey
         colormodel=TwoColorRamp outline;
   text x=Weight_Cat y=Smoking_Cat text=Count / textattrs=(size=16pt);
   yaxis display=(nolabel) reverse;
   xaxis display=(nolabel);
run;
heatmapstmt2

I'm excited by the HEATMAP statement in PROC SGPLOT. I think it is a great addition to one of my favorite SAS procedures.

Post a Comment

The drunkard's walk in 2-D

Last month I wrote about how to simulate a drunkard's walk in SAS for a drunkard who can move only left or right in one direction. A reader asked whether the problem could be generalized to two dimensions. Yes! This article shows how to simulate a 2-D drunkard's walk.

In fact, it is possible to simulate the drunkard's walk in d dimensions. The drunkard starts at the origin of the coordinate system. He chooses a random coordinate direction and takes a unit step in either the positive or negative direction for that coordinate. If the drunkard takes N steps, you can determine the probability that the drunkard is in a particular location.

The computational methods used to simulate the drunkard's walk in higher dimensions are similar to the 1-D walk, so see my previous article for the background information.

Visualizing a drunkard's walk

In two dimensions, you can use a series plot to visualize the path of the drunkard as he stumbles to the north, south, east, and west. The following PROC IML program uses the Table distribution to draw 10 random values from the set {1, 2}, which are the two coordinate directions. The SAMPLE function makes random draws from the set {-1, 1} and determines whether the drunkard steps forward or backward in the chosen coordinate direction:

proc iml;
call randseed(54321);
dim = 2;                   /* number of directions */
NumSteps = 10;             /* number of steps */
 
coord = j(NumSteps,1,.);   /* allocate vector for random direction */
call randgen(coord, "Table", j(1,dim,1)/dim); /* random directions */
incr = sample({-1 1}, 1 // NumSteps);         /* random step +/-1 */
step = T(1:NumSteps);
print step coord incr;
t_drunkard2d

The value coord=1 means that the drunkard takes a step in the x direction. The value coord=2 means that the drunkard steps in the y direction. The table shows that the drunkard's first step is a backward step in the x direction (west). The next step is a negative step in the y direction (south), and so forth. The complete trajectory is west, south, south, east, west, north, north, east, south, and east.

These two random vectors can be combined to form an N x d matrix where each column is a coordinate direction, each row is a time step, and the values 1, -1, and 0 indicate that the drunkard moves forward, backward, or does not move, respectively. This matrix contains the same information as the previous table. You can use the SUB2NDX function to convert subscripts into indices, and therefore create the matrix, as follows:

x = j(NumSteps, dim, 0);      /* (N x d) array */
idx = sub2ndx(dimension(x), step||coord);
x[ idx ] = incr;
print x;
t_drunkard2d2

You can encapsulate the preceding statements into a module that implements the Drunkard's Walk algorithm. You can download the complete SAS program that generates the graphs and computations in this article.

To visualize the path, you can use the CUSUM function on each column of the x matrix to accumulate the steps. The accumulation creates a new matrix that contains the position of the drunkard after each step. The following series plot attempts to show the path of the drunkard, with locations labeled by time. For most random walks, there is considerable overplotting. The graph show that the drunkard started at the origin and sequentially moved west, south, south, east, west, and so forth. The final position of the drunkard is at (1, -1). This path is typical in that the drunkard staggers near the origin. Although it is possible for the drunkard to always choose the same direction, it is not probable.

drunkard2d

Visualizing the probability distribution of the final position

Just as in the one-dimensional case, you can use simulation to visualize the probability distribution of the final position of the drunkard. If the drunkard takes an even number of steps, he must end up an even distance away from the origin, where the distance is calculated by using the Minkowski metric, also known as the "Manhattan" or "taxicab" metric. The probability is highest that the drunkard returns to the origin, followed by the points that are two units away, four units away, and so forth. The following SAS/IML statements call the DrunkardWalk function 10,000 times and plots the counts for each final position by using a heat map:

/* Assume DrunkardWalk function is defined (see link to program).
   Compute final positions of 10,000 walks */
NumSteps = 6;
NumWalks = 10000;
finalPos = j(NumWalks, dim);
do i = 1 to NumWalks;
   x = DrunkardWalk(dim, NumSteps);
   finalPos[i,] = x[+,];
end;
 
create walk from finalPos[c={"x" "y"}];   /* write to SAS data set */
append from finalPos;
close;
submit;
   proc freq data=walk noprint;           /* PROC FREQ computes counts */
   tables x*y / list sparse out=freqOut;
   run;
endsubmit;
/* read results back into SAS/IML vectors */
use freqOut;   read all var {x y count};   close;
 
/* reshape counts into rectangular matrix and visualize with heat map */
nX = range(x) + 1;
nY = range(y) + 1;
W = shape(count, nX, nY);
call HeatmapCont(W) xValues = unique(x) yValues=unique(y);

drunkard2d2

The heat map estimates the probability distribution of the final position of the drunkard after taking six steps. Notice that the drunkard must land on a spot that is an even number of units (in the Minkowski metric) from the origin. About 10% of the time the drunkard returned to the origin. About 7% of the time, the drunkard ends at (1,1), (-1,1), (-1,-1), or (1,-1). About 5% of the time the drunkard ends at (2,0), (0,2), (-2,0), or (0,-2). Positions that are four units away are less probable. Final positions that are six units away occurred during the simulation, but were so rare that they are barely visible in the heat map. For example, only 14 walks ended at the location (0, 6).


The drunkard's walk in two or more dimensions is a fun simulation of a discrete distribution. SAS/IML provides the necessary functionality to simulate many walks and to analyze the distribution of the results.

Post a Comment

Those tricky PERCENT formats

When using SAS to format a number as a percentage, there is a little trick that you need to remember: the width of the formatted value must include room for the decimal point, the percent sign, and the possibility of two parentheses that indicate negative values. The field width must include room for the parentheses even if the number that you are formatting is strictly positive!

The PERCENTw.d format

The PERCENTw.d format takes a proportion and displays it as a percentage. For example, 0.123 gets displayed as 12.3%. Negative values are displayed by enclosing them in parentheses, so -0.123 is displayed as (12.3%). Thus for proportions in the range (-1, 1), you need to allow at least six spaces: one each for the decimal point and percent sign, and two each for the number and the parentheses. This leads to the following heuristic:

For the PERCENTw.d format, the field width should be at least 6 more than the number of decimal places that you want to display.

In other words, for the PERCENTw.d format, a good rule of thumb is to set w = d + 6. Smaller field widths will either steal space from the decimal width or display the value as "*%", which means "unable to display this value by using the current field width."

My favorite formats are PERCENT7.1 and PERCENT8.2. An exception to the above heuristic is that I sometimes use PERCENT5.0 for variables that appear along the axis of a graph; you can use 5 for the field width because there is no decimal point to display. The following SAS DATA step displays several positive and negative proportions in the PERCENT7.1 and PERCENT8.2 formats:

data Percent;
format x BEST8. 
       p1 PERCENT7.1
       p2 PERCENT8.2;
label x = "Raw value"
      p1 = "PERCENT7.1"
      p2 = "PERCENT8.2";
input x @@;
p1 = x; 
p2 = x; 
datalines;
0.7543  0.0123 -0.0123 -0.7543
;
 
proc print data=Percent noobs label; run;
t_percent

If you have values that are greater than 1, you will need to increase the field width, since these will be displayed as 150% (1.5) or 2500% (25).

The PERCENTNw.d format

Not every SAS programmer knows about the PERCENTNw.d format. Notice the extra 'N' in the name, which stands for 'negative.' This format is similar to the PERCENTw.d format except that negative values are displayed by using a negative sign, rather than parentheses.

For many applications, I prefer the PERCENTNw.d format because of the way that it handles negative values. The heuristic for specifying the field width is the same as for the PERCENTw.d format. This might be surprise you. Since the PERCENTNw.d format uses a negative sign instead of parentheses, shouldn't you be able to specify a smaller field width?

No. For ease of converting between the PERCENTw.d and PERCENTNw.d formats, the rules are the same. The documentation for the PERCENTNw.d format states that "the width of the output field must account for the minus sign (–), the percent sign (%), and a trailing blank [emphasis added], whether the number is negative or positive."

You can modify the previous DATA step to display the same raw values by using the PERCENTNw.d format. The results follow:

t_percent2

In summary, remember to pad the field width for these formats by including space for the decimal point, the percent sign, and the possible negative sign or parentheses. You must do this even if your values are positive.

This blog post was inspired by an exchange on the SAS-L discussion forum. If you are new to Base SAS programming, The "L" is a good place to lurk-and-learn or to ask questions. I also recommend the SAS Support Communities for learning SAS software and related topics such as statistics, data mining, and matrix computations.

Post a Comment

Regression coefficient plots in SAS

coefplot1

Last week's post about odds ratio plots in SAS made me think about a similar plot that visualizes the parameter estimates for a regression analysis. The so-called regression coefficient plot is a scatter plot of the estimates for each effect in the model, with lines that indicate the width of 95% confidence interval (or sometimes standard errors) for the parameters. A sample regression coefficient plot is shown. Variables whose confidence intervals intersect the reference line at 0 are not significant. This article describes how to construct coefficient plots in SAS.

In contrast to the odds ratio plot, SAS procedures do not produce a coefficient automatically, but almost every SAS procedure has an option to produce a parameter estimates table that contains estimates, standard errors, and confidence intervals. By using the ODS OUPUT statement and PROC SGPLOT, it is easy to turn this table into a regression coefficient plot.

A coefficient plot of continuous regressors

The simplest example is for ordinary least squares (OLS) regression on continuous variables. The following data is presented in the documentation for PROC REG. The data specifies "the percentage of raw material that responds in a reaction" according to five explanatory variables. The PROC REG statement runs a main-effects model and uses the CLB option on the MODEL statement to request confidence limits for the coefficients. The ODS OUTPUT statement writes the ParameterEstimates table to a SAS data set called ParamEst.

%let XVars = FeedRate Catalyst AgitRate Temperature Concentration;
data reaction;
input &XVars ReactionPercentage @@;
datalines;
10.0   1.0  100   140  6.0   37.5  10.0   1.0  120   180  3.0   28.5
10.0   2.0  100   180  3.0   40.4  10.0   2.0  120   140  6.0   48.2
15.0   1.0  100   180  6.0   50.7  15.0   1.0  120   140  3.0   28.9
15.0   2.0  100   140  3.0   43.5  15.0   2.0  120   180  6.0   64.5
12.5   1.5  110   160  4.5   39.0  12.5   1.5  110   160  4.5   40.3
12.5   1.5  110   160  4.5   38.7  12.5   1.5  110   160  4.5   39.7
;
ods graphics off;
proc reg data=reaction;
   model  ReactionPercentage = &XVars / CLB;
   ods output ParameterEstimates = ParamEst;
quit;

The data set contains four variables that are need to create the coefficient plot. The Variable variable contains the name of the effects, the Estimate variable contains the parameter estimates, and the LowerCL and UpperCL variables contain the lower and upper limits, respectively, for the 95% confidence interval for the parameters. The following call to the SGPLOT procedure arranges each parameter on the Y axis in the order in which they are specified in the model. The estimates and confidence intervals are plotted on the X axis for each parameter. (An alternative ordering, which I recommend, is to sort the ParamEst data by the size of the estimate before you create the graph.)

/* OPTIONAL SORT: proc sort data=ParamEst;  by Estimate;  run; */
title "Parameter Estimates with 95% Confidence Limits";
proc sgplot data=ParamEst noautolegend;
   scatter y=Variable x=Estimate / xerrorlower=LowerCL xerrorupper=UpperCL
           markerattrs=(symbol=diamondfilled);
   refline 0 / axis=x;
   xaxis grid;
   yaxis grid display=(nolabel) discreteorder=data reverse;
run;
coefplot2

The graph (click to enlarge) is dominated by the large estimate and confidence interval for the Intercept parameter. Consequently, it is difficult to see the range of the confidence intervals for the small parameters. You have to look at the original table of estimates in order to conclude that the AgitRate variable is not significant (the interval contains 0) whereas the Temperature variable is highly significant.

This example demonstrates why these plots are not always useful. In many cases the varied magnitudes of the coefficients result in a coefficient plot that does not visualize the complete set of parameter estimates well. The analyst is forced to look at the table itself to supplement the insights from the coefficient plot.

Incidentally, if you want the half-width of the intervals to be one standard error, you can use a simple DATA step to create the upper and lower limits. For example, Lower = Estimate - StdErr and Upper = Estimate + StdErr.

Coefficient plots for other models

You can also create coefficient plot from the parameter estimates produced by other SAS regression procedures. However, be aware that the names of the variables in the ODS OUTPUT data set might differ from procedure to procedure. There might also be rows in the parameter estimates table that do not correspond to regression effects. For example, the following statements call the ROBUSTREG procedure to compute robust estimates for the same main-effects model:

proc robustreg data=reaction method=M;
   model  ReactionPercentage = &XVars;
   ods output ParameterEstimates = RobParamEst;
quit;

The format of the ParameterEstimates table in the ROBUSTREG procedure is slightly different from the REG version. Whereas REG uses Variable as the name of the column that identifies the parameters, the ROBUSTREG procedure uses Parameter. Consequently, always look at the names of the variables produced by the particular procedure that you are using. The names for the lower and upper confidence limits might also vary across procedures.

The ParameterEstimates table for the robust fit includes a "Scale" parameter, which is fit as part of the model. The parameter estimates from PROC GENMOD also includes a row for a "Scale" parameter.

The following call to PROC SGPLOT uses a WHERE clause to omit the estimate of "Scale." I've also excluded the "Intercept" estimate to focus the visualization on the other estimates.

proc sgplot data=RobParamEst noautolegend;
   where Parameter not in ("Intercept" "Scale");
   scatter y=Parameter x=Estimate / xerrorlower=LowerCL xerrorupper=UpperCL
           markerattrs=(symbol=diamondfilled);
   refline 0 / axis=x;
   xaxis grid;
   yaxis grid display=(nolabel) discreteorder=data reverse
         offsetmin = 0.1 offsetmax = 0.1;
run;

The graph is shown at the top of this article. With the intercept estimate gone, you can now resolve the small intervals around the estimates for AgitRate and Temperature.

Using coefficient plots to compare models

One application of the coefficient plot is to provide a visual comparison of different regression models. You can merge the results from different models that use the same explanatory variables but different estimation methods. The following statements merge the results from the OLS and robust regressions:

data All;
set ParamEst(rename=(Variable=Parameter) in=Reg)
    RobParamEst(where=(Parameter^="Scale"));
if Reg then Type="OLS   "; else Type="Robust";
run;
 
proc sgplot data=All;
   where Parameter ^= "Intercept";
   scatter y=Parameter x=Estimate / xerrorlower=LowerCL xerrorupper=UpperCL
           markerattrs=(symbol=diamondfilled)
           group=Type groupdisplay=cluster; /* display side-by-side estimates */
   refline 0 / axis=x;
   xaxis grid min= -1;
   yaxis grid colorbands=odd display=(nolabel) discreteorder=data reverse;
run;
coefplot3

For this plot, the COLORBANDS= option is used to add alternating horizontal stripes to the plot. The estimates are grouped by using the Type variable, and GROUPDISPLAY=CLUSTER option slightly offsets the estimates for each group. The plot makes it apparent that the robust estimates are about the same, but the robust confidence intervals are smaller than the corresponding OLS intervals.

Conclusions

The coefficient plot can be a useful tool for visualizing estimates in a regression model with many parameters. It has difficulties for models that have widely varying parameter estimates. This article show how to use ODS OUPTUT and the SGPLOT procedure to create regression coefficient plots in SAS.

Coefficient plots seems to have found favor with social scientists. The following references provide more details.

Post a Comment

Where did it come from? Adding the source of each observation to a SAS data set

Imagine the following scenario. You have many data sets from various sources, such as individual stores or hospitals. You use the SAS DATA step to concatenate the many data sets into a single large data set. You give the big data set to a colleague who will analyze it. Later that day, he comes back and says "One of these observations is very interesting. Can you tell me which of the original data sets this observation came from?"

There's a simple way to generate the answer this question. When you use the SET option to concatenate the data sources, use the INDSNAME= option on the SET statement. The INDSNAME= option was introduced back in the SAS 9.2 days, but it is not as well-known as it should be, considering how useful it is.

Recording the name of a data source

Let's create some example data. The following SAS macro code creates 10 SAS data sets that each have a variable called X. A subsequent DATA step uses the SET statement to vertically concatenate the 10 data sets into a single data set:

%macro CreateDS;
%do i = 0 %to 9;
   data A&i;  x = &i;  run;
%end;
%mend;
 
%CreateDS;     /* create data sets A0, A1, A2, ..., A9 */
 
data Combined;  /* concatenate into a single data set */
set A0-A9;
run;

The Combined data set has 10 observations, one from each of the original data sets. However, there is no variable that indicates which observation came from which of the original data sets.

Longtime SAS programmers are undoubtedly familiar with the IN= data set option. You can put the IN= option inside parentheses after each data set name. It provides a (temporary) flag variable that you can use during the DATA step. For example, the following DATA step creates a variable named SOURCE that contains the name of the source data set.

/* create SOURCE variable: Using the IN= option is long and tedious */
data Combined;
set A0(in = ds0)  
    A1(in = ds1)  A2(in = ds2)  A3(in = ds3)
    A4(in = ds4)  A5(in = ds5)  A6(in = ds6)
    A7(in = ds7)  A8(in = ds8)  A9(in = ds9);
if ds0 then source = "A0";
else if ds1 then source = "A1";
else if ds2 then source = "A2";
else if ds3 then source = "A3";
else if ds4 then source = "A4";
else if ds5 then source = "A5";
else if ds6 then source = "A6";
else if ds7 then source = "A7";
else if ds8 then source = "A8";
else if ds9 then source = "A9";
run;
 
proc print data=Combined; run;
t_indsname1

An easier way: Use the INDSNAME= option

The IN= data set option gets the job done, but for this task it is messy and unwieldy. Fortunately, there is an easier way. You can use the INDSNAME= option on the SET statement to create a temporary variable that contains the name of the source data set. You can then use the SCAN function to extract the libref and data set name to permanent variables, as follows:

/* create SOURCE variable: Using the INDSNAME= option is quick and easy */
data Combined2;
set A0-A9 indsname = source;  /* the INDSNAME= option is on the SET statement */
libref = scan(source,1,'.');  /* extract the libref */
dsname = scan(source,2,'.');  /* extract the data set name */
run;
 
proc print data=Combined2; run;
t_indsname2

Success! Now the source for each observation is clearly indicated.

For an additional example, see the SAS Sample "Obtain the name of data set being read with SET statement with INDSNAME= option." For more cool DATA step features, see the paper "New in SAS 9.2: It’s the Little Things That Count" (Olson, 2008).

Post a Comment