Statistical model building and the SELECT procedures in SAS

Last week I read an interesting paper by Bob Rodriguez: "Statistical Model Building for Large, Complex Data: Five New Directions in SAS/STAT Software." In it, Rodriguez summarizes five modern techniques for building predictive models and highlights recent SAS/STAT procedures that implement those techniques. The paper discusses the following high-performance (HP) procedures in SAS/STAT software:

  • The GLMSELECT procedure builds parsimonious linear regression models. This procedure supports modern effect selection techniques such as LASSO, LAR, and the elastic net. (Technically, the GLMSELECT procedure pre-dates the HP procedures. However, it is multithreaded. The HPREG procedure provides similar functionality in a distributed environment.)
  • The HPGENSELECT procedure builds generalized linear models such as Poisson models, zero-inflated models, Tweedie models, and more. The procedure supports the selection of effects.
  • The QUANTSELECT procedure builds quantile regression models and features effect selection.
  • The GAMPL procedure fits generalized additive models. I have previously shown the power of the GAMPL procedure to fit binary response data.
  • The HPSPLIT procedure builds classification and regression trees. Tree-based procedures are a popular choice when you want a nonparametric model that is interpretable in terms of business rules.

If you are unfamiliar with these newer procedures, I encourage you to read the Rodriguez paper. Although they are designed for distributed computations, you can also run these procedures in single-machine mode.

Thoughts on the "SELECT" procedures

The GLMSELECT, HPGENSELECT, and (HP)QUANTSELECT procedures support choosing a small number of effects from among a large list of possible effects. Some statisticians call these "variable selection" procedures, but "effect selection" is a more appropriate term because an important use case is to use the procedures to select important interaction effects from the list of all second order interactions. Regardless of what you call them, the procedures automatically select a small number of effects that provide a good predictive model from among hundreds or thousands of possible effects. You can either accept the selected model or use traditional modeling techniques to refine the model from among the selected candidate effects.

While thinking about the use of the word "SELECT" in the names of these procedures, it occurred to me that there is another SAS/STAT procedure that contains the word SELECT, and that is the SURVEYSELECT procedure. However, the SURVEYSELECT procedure is different in that it randomly selects observations (rows in the design matrix) whereas the previous procedures select variables (columns in the design matrix).

This is not the only example of this observation-variable dichotomy in SAS/STAT. The cluster procedures all have "CLUS" as part of their names. The ACECLUS, CLUSTER, FASTCLUS, and MODECLUS procedures all attempt to group observations into clusters. However, the VARCLUS procedure is a dimension reduction technique that groups variables together.

Post a Comment

Do you write unnecessary SAS statements?

I'm addicted to you.
You're a hard habit to break.
Such a hard habit to break.
—  Chicago, "Hard Habit To Break"

Habits are hard to break. For more than 20 years I've been putting semicolons at the end of programming statements in SAS, C/C++, and Java/Javascript. But lately I've been working in a computer language that does not require semicolons. Nevertheless, my fingers have a mind of their own, and I catch myself typing unnecessary semicolons out of habit.

I started thinking about superfluous statements in the SAS language. Some programmers might argue that if the program still runs correctly, then unnecessary statements are inconsequential. However, as a general rule I think it is a good programming practice to avoid writing unnecessary statements.

Here are a few example of unnecessary SAS statement. Can you think of more?

A RUN statement after a DATALINES statement

The doc for the DATALINES statement in the SAS DATA step states: "The DATALINES statement is the last statement in the DATA step. Use a null statement (a single semicolon) to indicate the end of the input data." In other words, you do not need a RUN statement after the semicolon to run the DATA step. The following example runs correctly and creates a data set:

data A;
input x @@;
datalines;
1 2 3 4 5 6
;                              /* <== no RUN statement required */

How many times have you seen a RUN statement after a DATALINES statement? Countless! I've even seen examples in the SAS documentation that use this unnecessary statement.

A semicolon after a macro call

If you define a macro that contains a complete set of valid SAS statements, you do not need another semicolon when you call the macro. For example, the following example is valid:

%macro TOP(dsname);
   proc print data=&dsname(obs=5); run;
%mend;
 
%TOP(sashelp.class)            /* <== no semicolon required */

It's not a big deal if you type the semicolon because a semicolon is the null statement. It has no performance implications. But for some reason it bothers me when I catch myself doing it.

A RUN statement in a fully interactive procedure

In a fully interactive procedure, statements are executed immediately. The RUN statement has no effect. Examples include PROC IML, PROC SQL, and PROC OPTMODEL. You use the QUIT statement to exit these procedures, which means that the RUN statement is never needed. The following program is correct and runs three statements. In interactive mode, each statement gets run when the SAS parser reaches the semicolon that ends the statement.

proc sql;
create table Example (x num, y num);          /* statement 1 */
insert into Example
   values(1, 2)  values(3, 4)  values(5, 6);  /* statement 2 */
select *
   from Example;                              /* statement 3 */
/* no RUN statement required! */

A RUN statement in (some) procedures that support RUN-group processing

Some SAS procedures are partly interactive. Procedures such as PROC DATASETS, PROC REG, and PROC GLM support RUN-group processing. For these procedures, the RUN statement defines blocks of statements that get executed, but the procedure remains running until it encounters a QUIT statement.

Many SAS/STAT procedures interpret QUIT to mean "run the most recent statements and then quit." For these procedures, you do not need a RUN statement before you call QUIT. The following statements run a regression analysis and then quit the procedure:

proc glm data=sashelp.class;
model weight = height;
quit;    /* <== No RUN statement; Runs previous statements, then quits */

Unfortunately, SAS procedures are not completely consistent in implementing the QUIT statement. In some SAS procedures the QUIT statement means "ignore the most recent statements and quit." The canonical examples are the traditional SAS/GRAPH procedures such as PROC GPLOT. In the following program, the first PLOT statement creates a scatter plot because it is followed by a RUN statement. However, the second plot statement is not followed by a RUN statement, so it is ignored and the second plot is not produced.

proc gplot data=sashelp.class;
   plot weight*height;
run;     /* <== executes previous PLOT statement; does not quit */
   plot weight*age;
quit;    /* <== ignores previous PLOT statement, then quits */
/* use RUN; QUIT; to produce the second plot */

If you aren't sure how a procedure behaves with regards to RUN-group processing, it is always safe to use the RUN and QUIT statements in tandem.

When to include optional statements?

The previous sections describe unnecessary statements that I like to skip. However, sometimes I include optional statements in my programs for clarity, readability, or to practice defensive programming.

SAS supports many optional statements. When you omit an optional statement, the procedure does some default behavior. For example, if you omit the VAR statement, most procedures runs on all numerical variables (for example, PROC MEANS) or on all variables (for example, PROC PRINT). When I want the default behavior, I skip the VAR statement.

Another statement that is technically optional is the RUN statement for a sequence of procedures. Because the next call to a procedure or DATA step will always end the previous procedure, you can technically omit the RUN statement for all but the last procedure. This means that the following program is valid, although I do not recommend this style of programming:

data class;
   set sashelp.class(where=(sex='M'));  /* 'class' is the _LAST_ data set */
proc means;                             /* DATA= _LAST_ */
proc print;                             /* DATA= _LAST_ */
run;

If I'm feeling lazy, I might write these statement during the early exploratory phase of a data analysis. However for serious work I terminate every procedure by using a RUN or QUIT statement. Skipping a RUN statement can lead to undesirable interactions with global statements such as the TITLE statement and ODS statements.

Your thoughts?

There is much more that can be said about these topics. What are your thoughts?

  • Are there unnecessary statements that you write out of habit?
  • Are there optional statements that you always include because it makes the program clearer?
Post a Comment

Color markers in a scatter plot by a third variable in SAS

One of my favorite new features in PROC SGPLOT in SAS 9.4m2 is addition of the COLORRESPONSE= and COLORMODEL= options to the SCATTER statement. By using these options, it is easy to color markers in a scatter plot so that the colors indicate the values of a continuous third variable.

If you are running an earlier release of SAS, you can use the Graph Template Language (GTL) to create a scatter plot where the markers are colored by a continuous response variable.

If you have a discrete variable (like gender), you can use the GROUP= option to color markers.

Color markers by a continuous response variable

You can use the COLORRESPONSE= option to visualize the values of a third variable by using colored markers in a scatter plot. For example, the following statements create a scatter plot of weight versus height for 19 students. Each marker is assigned a color that reflects the age of the student.

title "Markers Colored by Age";
proc sgplot data=sashelp.class;
scatter x=height y=weight / colorresponse=age
        markerattrs=(symbol=CircleFilled size=14)  /* big filled markers */
        datalabel=Name;                            /* add labels to markers */
run;
Markers colored by third variable

Click on the image for a larger version. By default, the plot uses a three-color gradient ramp. The smallest value of age (Joyce, age 11) is colored blue. The largest color (Phillip, age 16) is colored red. Markers that correspond to ages near the midrange (in this case, 13.5) are colored black. The gradient color ramp is shown on the right side of the plot. The plot shows that shorter and lighter students tend to be younger than taller and heavier students.

Use a different color ramp

The default color ramp is called ThreeColorAltRamp and is shown above. (The "Alt" version is suitable for use on a light-colored background.) You can change the color ramp by using the COLORMODEL= option in the SCATTER statement. The ODS graphics system provides several other two- and three-color color ramps. One is called the TwoColorRamp color ramp, which uses white as a color for low values and blue as a color for high values. Because the background of the graph is white, some markers will be hard to see unless you take additional actions. The following statements change the background color to a light gray and put a black outline around each marker so that an almost-white marker is still visible against the background:

title2 "Colormodel = TwoColorRamp";
proc sgplot data=sashelp.class;
styleattrs wallcolor=CXF0F0F0;                     /* light gray background */
scatter x=height y=weight / colorresponse=age 
        markerattrs=(symbol=CircleFilled size=14)  /* big filled markers */
        filledoutlinedmarkers                      /* add outlines */
        colormodel=TwoColorRamp;                   /* white-blue ramp */
run;
Color markers by two-color gradient ramp

Define a custom color ramp

You can also use the COLORMODEL= option to create a customer color ramp. There are multiple ways to specify colors in SAS, including English words and hexadecimal integers that begin with the prefix 'CX.' For example, you could create a four-color ramp by using the following COLORMODEL= option:

   colormodel=(blue green orange red);       /* custom 4-color ramp */

Be aware that not all color ramps are created equal. If your color ramp contains colors that vary a lot in brightness and intensity, the viewer's eye will be drawn towards certain markers and aware from others. This can bias the way that the viewer perceives the data.

I wrote a previous article that discussed making wise choices when you create a custom color ramp. In the following statements, I use colors from a six-color "spectral" color ramp in which all colors are perceptually equivalent. (I use the PALETTE function in SAS/IML to generate perceptually balanced color ramps.)

title2 "Custom Colormodel";
proc sgplot data=sashelp.class;
scatter x=height y=weight / colorresponse=age
   markerattrs=(symbol=CircleFilled size=14)
   filledoutlinedmarkers 
   colormodel=(CX3288BD CX99D594 CXE6F598 CXFEE08B CXFC8D59 CXD53E4F);
xaxis grid; yaxis grid;
Markers colored by custom gradient color ramp

Change the location of the gradient legend

By default the gradient legend appears on the right side of the plot. You can change the location and a few other attributes of the gradient legend by using the GRADIENTLEGEND statement in SAS 9.4m2. For example, if you want the gradient legend to appear at the bottom of the plot and want to modify the title for the legend, you add the following SGPLOT statement:

gradlegend / position=bottom title="Age (yrs)";

In SAS 9.4m3, the COLORRESPONSE= and COLORMODEL= options are not only available in the SCATTER statement, but also are available in other statements, including the DOT, HBAR, HIGHLOW, SERIES, VBAR, VECTOR, and WATERFALL statements.

Post a Comment

Absorbing Markov chains in SAS

Last week I showed how to represent a Markov transition matrix in the SAS/IML matrix language. I also showed how to use matrix multiplication to iterate a state vector, thereby producing a discrete-time forecast of the state of the Markov chain system. This article shows that the expected behavior of a Markov chain can often be determined just by performing linear algebraic operations on the transition matrix.

Absorbing Markov chains

Whereas the system in my previous article had four states, this article uses an example that has five states. The ith row of the following transition matrix gives the probabilities of transitioning from State i to any other state:

proc iml;
/* i_th row contains transition probabilities from State i */
P = { 0    0.5  0    0.5  0,
      0.5  0    0.5  0    0,
      0    0.5  0    0    0.5,
      0    0    0    1    0,
      0    0    0    0    1 };

For example, the last row of the matrix indicates that if the system is in State 5, the probability is 1 that it stays in State 5. This is the definition of an absorbing state. An absorbing state is common for many Markov chains in the life sciences. For example, if you are modeling how a population of cancer patients might respond to a treatment, possible states include remission, progression, or death. Death is an absorbing state because dead patients have probability 1 that they remain dead. The non-absorbing states are called transient. The current example has three transient states (1, 2, and 3) and two absorbing states (4 and 5).

If a Markov chain has an absorbing state and every initial state has a nonzero probability of transitioning to an absorbing state, then the chain is called an absorbing Markov chain. The Markov chain determined by the P matrix is absorbing. For an absorbing Markov chain, you can discover many statistical properties of the system just by using linear algebra. The formulas and examples used in this article are taken from the online textbook by Grimstead and Snell.

The first step for analyzing an absorbing chain is to permute the rows and columns of the transition matrix so that all of the transient states are listed first and the absorbing states are listed last. (The P matrix is already in this form.) If there are k absorbing states, the transition matrix in block form looks like the following:

Block form of an absorbing Markov transition matrix

The bottom right block of the transition matrix is a k x k identity matrix and represents the k absorbing states. The top left block contains the probabilities of transitioning between transient states. The upper right block contains the probabilities of transitioning from a transient state to an absorbing state. The lower left block contains zeros because there is no chance of transitioning from an absorbing state to any other state.

The following SAS/IML statements show how to extract the Q and R matrices from the P matrix:

k = sum(vecdiag(P)=1);      /* number of absorbing states */
nt = ncol(P) - k;           /* number of transient states */
 
Q = P[1:nt, 1:nt];          /* prob of transitions between transient states */ 
R = P[1:nt, nt+1:ncol(P)];  /* prob of transitions to absorbing state */

Expected behavior of absorbing Markov chains

By definition, all initial states for an absorbing system will eventually end up in one of the absorbing states. The following questions are of interest. If the system starts in the transient state i, then:

  1. What is the expected number of steps the system spends in transient state j?
  2. What is the expected number of steps before entering an absorbing state?
  3. What is the probability that the system will be absorbed into the jth absorbing state?

The answers to these questions are obtained by defining the so called fundamental matrix, which is N = (I-Q)-1. The fundamental matrix answers the first question because the entries of N are expected number of steps that the system spends in transient state j if it starts in transient state i:

transState = char(1:nt);        /* labels of transient states */
absState = char(nt+1:ncol(P));  /* labels of absorbing states */
 
/* Fundamental matrix gives the expected time that the system is 
   in transient state j if it starts in transient state i */
N = inv(I(nt) - Q);  
print N[L="Expected Time in State j" r=transState c=transState];
Expected time in State j for an absorbing Markov chain

The first row indicates that if the system starts in State 1, then on average it will spend 1.5 units of time in State 1 (including the initial time period), 1 unit of time in State 2, and 0.5 units of time in State 3 before transitioning to an absorbing state. Similarly, if the system starts in State 2, you can expect 1 unit, 2 units, and 1 unit of time in States 1, 2, and 3, respectively, before transitioning to an absorbing state.

Obviously, if you sum up the values for each row, you get the expect number of steps until the system transitions to an absorbing state:

/* Expected time before entering an absorbing state if the system
   starts in the transient state i */
t = N[,+];
print t[L="Expected Time until Absorption" r=transState];
Expected ime until absorption for an absorbing Markov chain

The remaining question is, to me, the most interesting. If the system starts in a transient state, you know that it will eventually transition into one of the k absorbing states. But which one? What is the probability that the system ends in the jth absorbing state if it starts in the ith transient state? The answer is obtained by the matrix multiplication N*R because the matrix N is the expected number of steps in each transient state and R contains the probability of transitioning from a transient state to an absorbing state. For our example, the computation is as follows:

/* The probability that an absorbing chain will be absorbed
   into j_th absorbing state if it starts in i_th transient state */
B = N*R;
print B[L="Absorption Probabilities" r=transState c=absState];
Absorption probabilities for an absorbing Markov chain with two absorbing states

The first row of the matrix indicates that if the system starts in State 1, it will end up in State 4 three quarters of the time and in State 5 one quarter of the time. The second rows indicates that if the system starts in State 2, there is a 50-50 chance that it will end up in State 4 or State 5.

Because this Markov chain is a stochastic process, you cannot say with certainty whether the system will eventually be absorbed into State 4 or State 5. However, starting the system in State 1 means that there is a higher probability that the final state will be State 4. Similarly, starting in State 3 means a higher probability that the final state will be in State 5.

There are many other statistics that you can compute for absorbing Markov chains. Refer to the references for additional computations.

References

Post a Comment

Break a sentence into words in SAS

Two of my favorite string-manipulation functions in the SAS DATA step are the COUNTW function and the SCAN function. The COUNTW function counts the number of words in a long string of text. Here "word" means a substring that is delimited by special characters, such as a space character, a period, or a comma. The SCAN function enables you to parse a long string and extract words. You can specify the delimiters yourself or use the default delimiters. Ron Cody discusses these and other string manipulation functions in his excellent 2005 tutorial, "An Introduction to SAS Character Functions."

Using the COUNTW and SCAN functions in the DATA step

For example, the following DATA step reads in a long line of text. The COUNTW function counts how many words are in the string. A loop then iterates over the number of words and the SCAN function extracts each word into a variable:

data parse;
length word $20;                 /* number of characters in the longest word */
input str $ 1-80;
delims = ' ,.!';                 /* delimiters: space, comma, period, ... */
numWords = countw(str, delims);  /* for each line of text, how many words? */
do i = 1 to numWords;            /* split text into words */
   word = scan(str, i, delims);
   output;
end;
drop str delims i;
datalines;
Introduction,to SAS/IML   programming!
Do you have... a question?
;
 
proc print data=parse;
run;
t_scan1

Notice that the delimiters do not include the '/' or '?' characters. Therefore these characters are considered to be part of words. For example, the strings "SAS/IML" and "question?" include those non-letter characters. Notice also that consecutive delimiters are automatically excluded, such as extra spaces or the ellipses marks.

Creating a vector of words in SAS/IML

One of the advantages of the SAS/IML matrix language is that you can call the hundreds of functions in Base SAS. When you pass in a vector of arguments to a Base SAS function, the function returns a vector that is the same size and shape as the parameter. In this way, you can vectorize the calling of Base SAS functions. In particular, you can pass in a vector of indices to the SCAN function and get back a vector of words. You do not need to write a loop to extract multiple words, as the following example demonstrates:

proc iml;
s = "Introduction,to SAS/IML... programming!";
delims = ' ,.!'; 
n = countw(s, delims);  
words = scan(s, 1:n, delims);  /* pass parameter vector: create vector of words */
print words;
t_scan2

In summary, Base SAS provides many useful functions such as the string manipulation functions. This article shows that when you call these functions from SAS/IML and pass in a parameter vector, you get back a vector of results.

Post a Comment

Markov transition matrices in SAS/IML

Many computations in elementary probability assume that the probability of an event is independent of previous trials. For example, if you toss a coin twice, the probability of observing "heads" on the second toss does not depend on the result of the first toss.

However, there are situations in which the current state of the system determines the probability of the next state. A stochastic process in which the probabilities depend on the current state is called a Markov chain.

A Markov transition matrix models the way that the system transitions between states. A transition matrix is a square matrix in which the (i,j)th element is the probability of transitioning from state i into state j. The sum of each row is 1. For reference, Markov chains and transition matrices are discussed in Chapter 11 of Grimstead and Snell's Introduction to Probability.

A transition matrix of probabilities

A Wikipedia article on Markov chains uses a sequence of coin flips to illustrate transitioning between states. This blog post uses the same example, which is described below.

Imagine a game in which you toss a fair coin until the sequence heads-tails-heads (HTH) appears. The process has the following four states:

  • State 1: No elements of the sequence are in order. If the next toss is tails (T), the system stays at State 1. If the next toss is heads (H), the system transition to State 2.
  • State 2: H. The first element of the sequence is in order. If the next toss is H, the system stays at State 2. If the next toss is T, the system transitions to State 3.
  • State 3: HT. The first two elements of the sequence in order. If the next toss is H, transition to State 4. If the next toss is T, transition to State 1.
  • State 4: HTH. The game is over. Stay in this state.

The transition matrix is given by the following SAS/IML matrix. The first row contains the transition probabilities for State 1, the second row contains the probabilities for State 2, and so on.

proc iml;
/* Transition matrix. Columns are next state; rows are current state */
/*     Null  H   HT  HTH */
P =    {0.5  0.5 0   0,   /* Null */
        0    0.5 0.5 0,   /* H    */
        0.5  0   0   0.5, /* HT   */
        0    0   0   1};  /* HTH  */
states = "1":"4";
print P[r=states c=states L="Transition Matrix"];

Analyzing sequences of coin tosses can be interesting and sometimes counterintuitive. Let's describe the expected behavior of this system.

The state vector

You can use a four-element row vector to represent the probabilities that the system is in each state. If the system is in the ith state, put a 1 in the ith element and zero elsewhere. Thus the state vector {1 0 0 0} indicates that the system is in State 1.

You can also use the state vector to represent probabilities. If the system has a 50% chance of being in State 1 and a 50% chance of being in State 2, the state of the system is represented by the state vector {0.5 0.5 0 0}.

The time-evolution of the system can be studied by multiplying the state vector and the transition matrix. If s is the current state of the system, then s*P gives the vector of probabilities for the next state of the system. Similarly, (s*P)*P = s*P2 describes the probabilities of the system being in each state after two tosses.

For the HTH-sequence game, suppose that you start a new game. The game starts in State 1. The following computation describes the evolution of the state vector:

s0 = {1 0 0 0};    /* a new game is in State 1 */
s1 = s0 * P;       /* probability distribution of states after 1 toss */
s2 = s1 * P;       /* probability distribution of states after 2 tosses */
print (s1//s2)[L="Prob of State" c=("1":"4") r={"1 toss" "2 tosses"}];
First two states in a Markov chain

The first row of the output gives the state probabilities for the system after one coin toss. The system will be in State 1 with probability 0.5 (if you toss T) and will be in State 2 with probability 0.5 (if you toss H). The second row is more interesting. The computation use the probabilities from the first toss to compute probabilities for the second toss. After two tosses, the probability is 0.25 for being in State 1 (TT), the probability is 0.5 for being in State 2 (TH or HH), and the probability is 0.25 for being in State 3 (HT).

Modeling a sequence of coin tosses

In general you can repeatedly multiple the state vector by the transition matrix to find the state probabilities after k time periods. For efficiency you should avoid concatenating results inside a loop. Instead, allocate a matrix and use the ith row to hold the ith state vector, as follows:

/* Iterate to see how the probability distribution evolves */
numTosses = 10;
s0 = {1 0 0 0};                     /* initial state */
s = j(numTosses+1, ncol(P), .);     /* allocate room for all tosses */
s[1,] = s0;                         /* store initial state */
do i = 2 to nrow(s);
   s[i,] = s[i-1,] * P;             /* i_th row = state after (i-1) iterations */
end;
iteration = 0:numTosses;            /* iteration numbers */
print s[L="Prob of State" c=("1":"4") r=(char(iteration))];
First 10 states in a Markov chain with one absorbing state

The output shows the state probabilities for a sequence of 10 coin tosses. Recall that the last row of the transition matrix ensures that the sequence stays in State 4 after it reaches State 4. Therefore the probability of the system being in State 4 is nondecreasing. The output shows that there is a 65.6% chance that the sequence HTH will appear in 10 tosses or less.

You can visualize the evolution of the probability distributions by making a series plot for each column of this output. You can download the SAS program that creates the plot and contains all of the computations in this article. The plot is shown below:

Predicted states for a Markov chain by iterating a trasition matrix

The plot shows the probability distributions after each toss when the system starts in State 1. After three time periods the system can be in any state. Over the long term, the system has a high probability of being in State 4 and a negligible chance of being in the other three states.

Modeling transitions in a population

You can also apply the transition matrix to a population of games. For example, suppose that many students are playing the game. At a certain instant, 50% of the games are in State 1, 30% are in State 2, and 20% are in State 3. You can represent that population of initial states by using the initial vector

s0 = {0.5 0.3 0.2 0};

The following graph gives the probability of the states for the population for the next 30 coin tosses:

Predicted states for a Markov chain by iterating a trasition matrix

The initial portion of the graph looks different from the previous graph because the population starts in a different initial distribution of states. However, the long-term behavior of this system is the same: all games eventually end in State 4. For this initial population, the graph shows that you should expect 80% of the games to be in State 4 by the 13th toss.

A real example: Predicting caseloads for social workers and parole officers

An interesting application of using Markov chains was presented by Gongwei Chen at SAS Global Forum 2016. Chen built a discrete-time Markov chain model to forecast workloads at the Washington State Department of Corrections. Social workers and parole officers supervise convicted offenders who have been released from prison or who were sentenced to supervision by the court system. Individuals who exhibit good behavior can transition from a highly supervised situation into less supervision. Other individuals might commit a new offense that requires an increase in supervision. Chen used historical data to estimate the transition matrix between the different supervisory states. His model helps the State of Washington forecast the resources needed to supervise offenders.

Summary

In summary, it is easy to represent a transition matrix and a state vector in SAS/IML. You can iterate the initial distribution of states to forecast the state of the system after an arbitrary number of time periods. This is done by using matrix multiplication.

A Markov chain model involves iterating a linear dynamical system. The qualitative asymptotic behavior of such systems can be described by using the tools of linear algebra. In a future article, I will describe how you can compute statistical properties of Markov chain models from the transition matrix.

Post a Comment

Cantor sets, the devil's staircase, and probability

Last week I blogged about how to draw the Cantor function in SAS. The Cantor function is used in mathematics as a pathological example of a function that is constant almost everywhere yet somehow manages to "climb upwards," thus earning the nickname "the devil's staircase."

The Cantor function has three properties: (1) it is continuous, (2) it is non-decreasing, and (3) F(0)=0 and F(1)=1. There is a theorem that says that any function with these properties is the cumulative distribution function (CDF) for some random variable. In theory, therefore, you can generate a large number of random variates (a large sample), then use PROC UNIVARIATE to plot the empirical CDF. The empirical CDF should be the devil's staircase function.

A random sample from the Cantor set

My last article showed that every point in the Cantor set can be written as the infinite sum x = Σi ai3-i where the ai equal 0 or 2. You can approximate the Cantor set by truncating the series at a certain number of terms. You can generate random points from the (approximate) Cantor set by choosing the coefficients randomly from the Bernoulli distribution with probability 0.5. The following DATA step generates 10,000 points from the Cantor set by using random values for the first 20 coefficients:

data Cantor;
call streaminit(123456);
k = 20;             /* maximum number of coefficients */
n = 10000;          /* sample size to generate */
do i = 1 to n;      /* generate n random points in Cantor set */
   x = 0;
   do j = 1 to k;   /* each point is sum of powers of 1/3 */
      b = 2 * rand("Bernoulli", 0.5);  /* random value 0 or 2 */
      x + b * 3**(-j);                 /* power of 1/3 */
   end;
   output;
end;
keep x;
run;
 
proc univariate data=Cantor;
   cdf x;
run;
CantorCDF

The call to PROC UNIVARIATE displays the empirical CDF for the random sample of points. And voilà! The Cantor distribution function appears! As a bonus, the output from PROC UNIVARIATE (not shown) indicates that the mean of the distribution is 1/2 (which is obvious by symmetry) and the variance is 1/8, which you might not have expected.

This short SAS program does two things: it shows how to simulate a random sample uniformly from the (approximate) Cantor set, and it indicates that the devil's staircase function is the distribution function for this a uniform random variable.

A random sample in SAS/IML

Alternatively, you can generate the same random sample by using SAS/IML and matrix computations. My previous article drew the Cantor function by systematically using all combinations of 0s and 2s to construct the elements of the Cantor set. However, you can use the same matrix-based approach but generate random coefficients, therefore obtaining a random sample from the Cantor set:

proc iml;
k = 20;          /* maximum number of coefficients */
n = 10##4;       /* sample size to generate */
B = 2 * randfun(n||k, "Bernoulli", 0.5);  /* random n x k matrix of 0/2 */
v = 3##(-(1:k)); /* powers of 1/3 */
x = B * v`;      /* sum of random terms that are either 0*power or 2*power */

Recall that ECDF is a step function that plots the ith ordered datum at the height i/n. You can approximate the empirical CDF in SAS/IML by using the SERIES subroutine. Technically, the lines that appear in the line plot have a nonzero slope, but the approximation looks very similar to the PROC UNIVARIATE output:

call sort(x, 1);             /* order elements in the Cantor set */
x = 0 // x // 1;             /* append end points */
y = do(0, 1, 1/(nrow(x)-1)); /* empirical CDF */
title "Empirical Distribution of a Uniform Sample from the Cantor Set";
call series(x, y);

Interpretation in probability

It is interesting that you can actually define the Cantor set in terms of rolling a fair six-sided die. Suppose that you roll a die infinitely often and we adopt the following rules:

  • If the die shows a 1 or 2, you get zero points.
  • If the die shows a 3 or 4, you get one point.
  • If the die shows a 5 or 6, you get two points.

This game is closely related to the Cantor set. Recall that the Cantor set can be written as the set of all base-3 decimal values in [0,1] for which the decimal expansion does not contain a 1. In this game, each point in the Cantor set corresponds to a sequence of rolls that never contain a 3 or 4. (Equivalently, the score is always even.) Obviously this would never happen in real life, which is an intuitive way of saying that the Cantor set has "measure zero."

References

The first time I saw the Cantor function depicted as an empirical distribution function was when I saw a very compact MATLAB formula like this:

x=2*3.^-[1:k]*(rand(k,n)<1/2);
stairs([min(x) sort(x)],0:1/length(x):1) % Plot the c.d.f of x

The previous formula is equivalent to my SAS/IML program, but in my program I broke the formula apart so that the components could be understood more easily. This formula appears in a set of probability demos by Peter Doyle. I had the privilege to interact with Professor Doyle at The Geometry Center (U. MN) in the early 1990s, so perhaps he was responsible for showing me the Cantor distribution.

These UNC course notes from Jan Hannig also discuss the Cantor distribution, but the original source is not cited.

Post a Comment

Visualize the Cantor function in SAS

CantorFunction

I was a freshman in college the first time I saw the Cantor middle-thirds set and the related Cantor "Devil's staircase" function. (Shown at left.) These constructions expanded my mind and led me to study fractals, real analysis, topology, and other mathematical areas.

The Cantor function and the Cantor middle-thirds set are often used as counter-examples to mathematical conjectures. The Cantor set is defined by a recursive process that requires infinitely many steps. However, you can approximate these pathological objects in a matrix-vector language such as SAS/IML with only a few lines of code!

Construction of the Cantor middle-thirds set

The Cantor middle-thirds set is defined by the following iterative algorithm. The algorithm starts with the closed interval [0,1], then does the following:

  1. In the first step, remove the open middle-thirds of the interval. The two remaining intervals are [0,1/3] and [2/3, 1], which each have length 1/3.
  2. In the ith step, remove the open middle-thirds of all intervals from the (i-1)th step. You now have twice as many intervals and each interval is 1/3 as long as the intervals in the previous step.
  3. Continue this process forever.
CantorSet

After two steps you have four intervals: [0,1/9], [2/9,1/3], [2/3, 7/9], and [8/9,1]. After three steps you have eight intervals of length 1/27. After k steps you have 2k intervals of length 3-k.

The Cantor set is the set of all points that never get removed during this infinite process. The Cantor set clearly contains infinitely many points because it contains the endpoints of the intervals that are removed: 0, 1, 1/3, 2/3, 1/9. 2/9, 7/9, 8/9, and so forth. Less intuitive is the fact that the cardinality of the Cantor set is uncountably infinite even though it is a set of measure zero.

Construction of the Cantor function

The Cantor function F: [0,1] → [0,1] can be defined iteratively in a way that reflects the construction of the Cantor middle-thirds set. The function is shown at the top of this article.

  • At step 0, define the function on the endpoints of the [0,1] interval by F(0)=0 and F(1)=1.
  • At step 1, define the function on the closed interval [1/3, 2/3] to be 1/2.
  • At step 2, define the function on [1/9, 2/9] to be 1/4. Define the function on [7/9, 8/9] to be 3/4.
  • Continue this construction. At each step, the function is defined on the closure of the middle-thirds intervals so that the function value is halfway between adjacent values. The function looks like a staircase with steps of difference heights and widths. The function is constant on every middle-thirds interval, but nevertheless is continuous and monotonically nondecreasing.

Visualizing the Cantor function in SAS

This is a SAS-related blog, so I want to visualize the Cantor function in SAS. The middle-third intervals during the kth step of the construction have length 3-k, so you can stop the construction after a small number of iterations and get a decent approximation. I'll use k=8 steps.

Although the Cantor set and function were defined geometrically, they have an equivalent definition in terms of decimal expansion. The Cantor set is the set of decimal values that can be written in base 3 without using the '1' digit. In other words, elements of the Cantor set have the form x = 0.a1a2a3... (base 3), where ai equals 0 or 2.

An equivalent definition in terms of fractions is x = Σi ai3-i where ai equals 0 or 2. Although the sum is infinite, you can approximate the Cantor set by truncating the series after finitely many terms. A sum like this can be expressed as an inner product x = a*v` where a is a k-element row vector that contains 0s and 2s and v is a vector that contains the elements {1/3, 1/9, 1/27, ..., 1/3-k}.

You can define B to be a matrix with k columns and 2k rows that contains all combinations of 0s and 2s. Then the matrix product B*v is an approximation to the Cantor set after k steps of the construction. It contains the right-hand endpoints of the middle-third intervals.

In SAS/IML you can use the EXPANDGRID function to create a matrix whose rows contain all combinations of 0s and 2s. The ## operator raises an element to a power. Therefore the following statements construct and visualize the Cantor function. With a little more effort, you can write a few more statements that improve the approximation and add fractional tick marks to the axes, as shown in the graph at the top of this article.

proc iml;
/* rows of B contain all 8-digit combinations of 0s and 2s */ 
B = expandgrid({0 2}, {0 2}, {0 2}, {0 2}, {0 2}, {0 2}, {0 2}, {0 2});
B = B[2:nrow(B),];      /* remove first row of zeros */
k = ncol(B);            /* k = 8 */
 
v = 3##(-(1:k));       /* vector of powers 3^{-i} */
t = B * v`;            /* x values: right-hand endpts of middle-third intervals */
 
u = 2##(-(1:k));       /* vector of powers 2^{-i} */
f = B/2 * u`;          /* y values: Cantor function on Cantor set */
call series(t, f);     /* graph the Cantor function */

I think this is a very cool construction. Although the Cantor function is defined iteratively, there are no loops in this program. The loops are replaced by matrix multiplication and vectors. The power of a matrix language is that it enables you to compute complex quantities with only a few lines of programming.

Do you have a favorite example from math or statistics that changed the way that you look at the world? Leave a comment.

References

This short article cannot discuss all the mathematically interesting features of the Cantor set and Cantor function. The following references are provided for the readers who want additional information:

Post a Comment

In praise of simple graphics

'Tis a gift to be simple.
-- Shaker hymn

In June 2015 I published a short article for Significance, a magazine that features statistical and data-related articles that are of general interest to a wide a range of scientists.

The title of my article is "In Praise of Simple Graphics." It is based on a blog post "Visualizing the causes of airline crashes."

My article compares infographics and statistical graphics. Infographics are designed to appeal as well as to inform. Unfortunately, a beautiful artistic display can sometimes obscure the data.

In contrast, a statistician usually has a different goal: represent the data objectively and let the data speak for themselves. Standard statistical graphics are purposely free of excess adornment in a Tuftean effort to maximize the data-ink ratio. Their beauty is in their minimalist simplicity.

Yes, I sometimes create complex graphs on my blog. In the past three weeks I've featured spaghetti plots, lasagna plots, and effect plots. However, I create complex graphs only to visualize complex data or models. For simple data, I advocate using a simple graph. I strive to never let the graph get in the way of the data. To paraphrase Einstein, graphs should be as complex as necessary, but no more complex.

You can read my article "In Praise of Simple Graphics" at the Significance web site. If you like data analysis, graphics, and statistical ideas, the Significance magazine archives are a great resource. All issues of Significance are freely available one year after publication. Enjoy!

Post a Comment

Use the EFFECTPLOT statement to visualize regression models in SAS

An effect plot (created by using the EFFECTPLOT statement) that visualizes a complex regression model

Graphs enable you to visualize how the predicted values for a regression model depend on the model effects. You can gain an intuitive understanding of a model by using the EFFECTPLOT statement in SAS to create graphs like the one shown at the top of this article.

Many SAS regression procedures automatically create ODS graphics for simple regression models. For more complex models (including interaction effects and link functions), you can use the EFFECTPLOT statement to construct effect plots. An effect plot shows the predicted response as a function of certain covariates while other covariates are held constant.

The EFFECTPLOT statement was introduced in SAS 9.22, but it is not as well known as it should be. Although many procedure include an EFFECTPLOT statement as part of their syntax, I will use the PLM procedure (PLM = post-linear modeling) to show how to construct effect plots. I have previously shown how to use the PLM procedure to score regression models. A good introduction to the PLM procedure is Tobias and Cai (2010), "Introducing PROC PLM and Postfitting Analysis for Very General Linear Models."

The data for this article is the Sashelp.BWeight data set, which is distributed with SAS. There are 50,000 records. Each row gives information about the birth weight of a baby, including information about the mother. This article uses the following variables:

  • MomAge: The mothers were between the ages of 18 and 45. The MomAge variable is centered at the mean age, which is 27. Thus MomAge=-7 means the mother was 20 years old whereas MomAge=5 means that the mother was 32 years old.
  • CigsPerDay: The average number of cigarettes per day that the mother smoked during pregnancy.
  • Boy: An indicator variable. If the baby was a boy, then Boy=1; otherwise Boy=0.

The following DATA step creates a SAS view that creates an indicator variable, Underweight, which has the value 1 if the baby's birth weight was less than 2500 grams and 0 otherwise:

/* Underweight=1 if the birth weight is <2500 grams and Underweight=0 otherwise */
data babyWeight / view=BabyWeight;
   set sashelp.bweight;
   Underweight = (Weight < 2500);
run;

A logistic model with a continuous-continuous interaction

To illustrate the capabilities of the EFFECTPLOT statement, the following statements use PROC LOGISTIC to model the probability of having an underweight boy baby (less than 2500 grams). The explanatory effects are MomAge, CigsPerDay, and the interaction effect between those two variables. The STORE statement creates an item store called logiModel. The item store is read by PROC PLM, which creates the effect plot:

proc logistic data=babyWeight;
   where Boy=1;                  /* restrict to baby boys */
   model Underweight(event='1') = MomAge | CigsPerDay;
   store logiModel;
run;
 
title "Probability of Underweight Boy Baby";
proc plm source=logiModel;
   effectplot fit(x=MomAge plotby=CigsPerDay);
run;
Effect plot (created by using the EFFECTPLOT statement): Predicted probability of underweight boy by mother's age and daily cigarettes

In this example, the output is a panel of plots that show the predicted probability of having an underweight boy baby as a function of the mother's relative age. (Remember: the age is centered at 27 years.) The panel shows slices of the continuous CigsPerDay variable, which enables you to see how the predicted response changes with increasing cigarette use.

The graphs indicate that the probability of an underweight boy is very low in nonsmoking mothers, regardless of the mother's age. In smoking mothers, however, the probability of having an underweight boy increases with age. For mothers of a given age, the probability of an underweight boy increases with the number of cigarettes smoked.

The example shows a panel of fit plots, where the paneling variable is determined by the PLOTBY= option. You can also "stack" the predicted probability curves by using a slice plot. You can specify a slice plot by using the SLICEFIT keyword. You specify the slicing variable by using the SLICEBY= option, as follows:

proc plm source=logiModel;
   effectplot slicefit(x=MomAge sliceby=CigsPerDay);
run;

An example of a slice plot is shown in the next section.

You can also use the EFFECTPLOT statement to create a contour plot of the predicted response as a function of the two continuous covariates, which is also shown in the next section.

A logistic model with categorical-continuous interactions

The effect plot is especially useful when visualizing complex models. When there are several independent variables and interactions, you can create multiple plots that show the predicted response at various levels of categorical or continuous variables. By default, covariates that do not appear in the plots are fixed at their mean level (for continuous variables) or their reference level (for classification variables).

The previous example used a WHERE clause to restrict the data to boy babies. Suppose that you want to include the gender of the baby as a covariate in the regression model. The following call to PROC LOGISTIC includes the main effects and two-way interactions between two continuous and one classification variable. The call to PROC PLM creates a panel of slice plots. Each slice plot shows predicted probability curves for slices of the CigsPerDay variable. The panels are determined by levels of the Boy variable, which is specified on the PLOTBY= option:

proc logistic data=babyWeight;
   class Boy;
   model Underweight(event='1') = MomAge | CigsPerDay | Boy @2;
   store logiModel;
run;
 
proc plm source=logiModel;
   effectplot slicefit(x=MomAge sliceby=CigsPerDay plotby=Boy);
run;

The output is shown in the graph at the top of this article. The right side of the panel shows the predicted probabilities for boys. These curves are similar to those in the previous example, but now they are overlaid on a single plot. The left side of the panel shows the corresponding curves for girl babies. In general, the model predicts that girl babies have a higher probability to be underweight (relative to boys) in smoking mothers. The effect is noticeable most dramatically for younger mothers.

If you want to add confidence limits for the predicted curves, you can use the CLM option: effectplot slicefit(...) / CLM.

You can specify the levels of a continuous variable that are used to slice or panel the curves. For example, most cigarettes come in a pack of 20, so the following EFFECTPLOT statement visually compares the effect of smoking for pregnant women who smoke zero, one, or two packs per day:

   effectplot slicefit(x=MomAge sliceby=CigsPerDay=0 20 40 plotby=Boy);

Notice that there are no parentheses around the argument to the SLICEBY= option. That is, you might expect the syntax to be sliceby=(CigsPerDay=0 20 40), but that syntax is not supported.

If you want to directly compare the probabilities for boys and girls, you might want to interchange the SLICEBY= and PLOTBY= variables. The following statements create a graph that has three panels, and each panel directly compares boys and girls:

proc plm source=logiModel;
   effectplot slicefit(x=MomAge sliceby=boy plotby=CigsPerDay=0 20 40);
run;

As mentioned previously, you can also create contour plots that display the predicted response as a function of two continuous variables. The following statements create two contour plots, one for boy babies and one for girls:

proc plm restore=logiModel;
   effectplot contour(x=MomAge y=CigsPerDay plotby=Boy);
run;
An effect plot (created by using the EFFECTPLOT statement) that visualizes the response surface for each level of a categorical variable

Summary of the EFFECTPLOT statement

The EFFECTPLOT statement enables you to create plots that visualize interaction effects in complex regression models. The EFFECTPLOT statement is a hidden gem in SAS/STAT software that deserves more recognition. The easiest way to create an effect plot is to use the STORE statement in a regression procedure to create an item store, then use PROC PLM to create effect plots. In that way, you only need to fit a model once, but you can create many plots that help you to understand the model.

You can overlay curves, create panels, and even create contour plots. Several other plot types are also possible. See the documentation for the EFFECTPLOT statement for the full syntax, options, and additional examples of how to create plots that visualize interactions in generalized linear models.

Post a Comment