Female world leaders by year of election

This week Hillary Clinton became the first woman to be nominated for president of the US by a major political party. Although this is a first for the US, many other countries have already passed this milestone. In fact, 60 countries have already elected women as presidents and prime ministers. For example, the UK was previously led by Margaret Thatcher (1979–1990) and is currently led by Theresa May.

The CNN web site published a list of the 60 countries that have been led by a woman president or prime minister. The list is long and requires that you scroll up and down to compare different countries. To help visualize the data, I decided to turn that list into a graphic. You can download the SAS program that contains the data and creates a graph of female world leaders.

I wanted the graph to show the first year that each country elected a female leader. Consequently, countries like Norway and the UK only appear once on the list even though they have been led by women multiple times.

Female world leaders, 1960-2016

I thought about several different graph types, but finally settled on the scatter plot to the left (click to enlarge). The graph shows the cumulative number of countries that have elected a female president or prime minister.

The vertical axis shows the rank of each country on the list, which makes it easy to find which country was first (Sri Lanka in 1960), second (India), third (Israel), and so forth. The tenth country was Norway. The fiftieth country was Denmark.

You can also easily find out which countries elected their first female leader in a particular year, since that information is plotted on the horizontal axis. However, the graph is not so good at finding a particular country unless you know the approximate election year. Can you find Peru?

If you create the HTML version of this graph in SAS, the graph contains tool tips so that you can hover the mouse over a country's name and find out the name of the woman who was elected. The graph on this page shows that Corazon Aquino was the first female president of the Philippines.

The graph shows that the 1990s was a breakout decade in which many countries elected their first female leader. This trend is better shown by using a bar graph that augments the previous information. The bar graph shows that about 15 countries were added to the list during each of the last three decades.

Election of female leaders by decade

Although countries did not elect female leaders until the 1960s, female rulers are not new to the world. Monarchs such as Nefertiti, Queen Elizabeth I, and Catherine the Great are some famous world leaders from previous centuries. Some women were so influential that they have epochs named after them, such as the Elizabethan era and the Victorian era.

There are still three years left in the 2010s, so the current decade is likely to be the decade in which the most countries elected a woman for the first time. Which country will be number 61 on the list? Time will tell.

Post a Comment

How to visualize a kernel density estimate

A kernel density estimate (KDE) is a nonparametric estimate for the density of a data sample. A KDE can help an analyst determine how to model the data: Does the KDE look like a normal curve? Like a mixture of normals? Is there evidence of outliers in the data?

In SAS software, there are two procedures that generate kernel density estimates. PROC UNIVARIATE can create a KDE for univariate data; PROC KDE can create KDEs for univariate and bivariate data and supports several options to choose the kernel bandwidth.

Kernel density estimate (KDE) and component densities

The KDE is a finite mixture distribution. It is a weighted sum of small density "bumps" that are centered at each data point. The shape of the bumps are determined by the choice of a kernel function. The width of the bumps are determined by the bandwidth.

In textbooks and lecture notes about kernel density estimation, you often see a graph similar to the one at the left. The graph shows the kernel density estimate (in blue) for a sample of 10 data values. The data values are shown in the fringe plot along the X axis. The orange curves are the component densities. Each orange curve is a scaled version of a normal density curve centered at a datum.

This article shows how to create this graph in SAS. You can use the same principles to draw the component densities for other finite mixture models.

The kernel bandwidth

The first step is to decide on a bandwidth for the component densities. The following statements define the data and use PROC UNIVARIATE to compute the bandwidth by using the Sheather-Jones plug-in method:

data sample;
input x @@;
46 60 24 15 17 14 21 59 22 16 
proc univariate data=sample;
   histogram x / kernel(c=SJPI);
Histogram and kernel density estimate (KDE)

The procedure create a histogram with a KDE overlay. The legend of the graph gives a standardized kernel bandwidth of c=0.27, but that is not the bandwidth you want. As explained in the documentation, the kernel bandwidth is derived from the normalized bandwidth by the formula λ = c IQR n-1/5, where IQR is the interquartile range and n is the number of nonmissing observations. For this data, IQR = 30 and n=10, so λ ≈ 5. To save you from having to compute these values, the SAS log displays the following NOTE:

NOTE: The normal kernel estimate for c=0.2661 has a bandwidth of 5.0377

The UNIVARIATE procedure can use several different kernel shapes. By default, the procedure uses a normal kernel. The rest of this article assumes a normal kernel function, although generalizing to other kernel shapes is straightforward.

Compute the component densities

Because the kernel function is centered at each datum, one way to visualize the component densities is to evaluate the kernel function on a symmetric interval about x=0 and then translate that component for every data point. In the following SAS/IML program, the vector w contains evenly spaced points in the interval [-3λ, 3λ], where λ is the bandwidth of the kernel function. (This interval contains 99.7% of the area under the normal curve with standard deviation λ.) The vector k is the normal density function evaluated on that interval, scaled by 1/n where n is the number of nonmissing observations. The quantity 1/n is the mixing probability for each component; the KDE is obtained by summing these components.

The program creates an output data set called component that contains three numerical variables. The ID variable is an ID variable that identifies which observation is being used. The z variable is a translated copy of the w variable. The k variable does not change because every component has the same shape and bandwidth.

proc iml;
use sample;  read all var {x};  close;
n = countn(x);
mixProb = 1/n;
lambda = 5;                                 /* bandwidth from PROC UNIVARIATE */
w = do(-3*lambda, 3*lambda, 6*lambda/100);  /* evenly spaced; 3 std dev */
k = mixProb * pdf("Normal", w, 0, lambda);  /* kernel = normal pdf centered at 0 */
ID= .; z = .;
create component var {ID z k};             /* create the variables */
do i = 1 to n;
   ID = j(ncol(w), 1, i);                  /* ID var */
   z = x[i] + w;                           /* center kernel at x[i] */
close component;

Compute the kernel density estimate

The next step is to sum the component densities to create the KDE. The easy way to get the KDE in a data set is to use the OUTKERNEL= option on the HISTOGRAM statement in PROC UNIVARIATE. Alternatively, you can create the full KDE in SAS/IML, as shown below.

The range of the data is [14, 60]. You can extend that range by the half-width of the kernel components, which is 15. Consequently the following statements use the interval [0, 75] as a convenient interval on which to sum the density components. The actual summation is easy in SAS/IML because you can pass a vector of positions to the PDF function i Base SAS.

The sum of the kernel components is written to a data set called KDE.

/* finite mixture density is weighted sum of kernels at x[i] */
a = 0; b = 75;    /* endpoints of interval [a,b] */
t = do(a, b, (b-a)/200);
kde = 0*t;
do i = 1 to n;
   kde = kde + pdf("Normal", t, x[i], lambda) * mixProb;
create KDE var {t kde}; append; close;

Visualize the KDE

You can merge the original data, the individual components, and the KDE curve into a single SAS data set called All. Use the SGPLOT procedures to overlay the three elements. The individual components are plotted by using the SERIES plot with a GROUP= option. A second SERIES plot graphs the KDE curve. A FRINGE statement plots the positions of each datum as a hash mark. The plot is shown at the top of this article.

data All;
merge sample component KDE;
title "Kernel Density Estimate as Weighted Sum of Component Densities";
proc sgplot data=All noautolegend;
   series x=z y=k / group=ID lineattrs=GraphFit2(thickness=1); /* components */
   series x=t y=kde /lineattrs=GraphFit;                       /* KDE curve  */
   fringe x;                                      /* individual observations */
   refline 0 / axis=y;
   xaxis label="x";
   yaxis label="Density" grid;

You can use the same technique to visualize other finite mixture models. However, the FMM procedure creates these plots automatically, so you might never need to create such a plot if you use PROC FMM. The main difference for a general finite mixture model is that the component distributions can be from different families and usually have different parameters. Therefore you will need to maintain a vector of families and parameters. Also, the mixing probabilities usually vary between components.

In summary, the techniques in this article are useful to teachers and presenters who want to visually demonstrate how choosing a kernel shape and bandwidth gives rise to the kernel density estimate.

Post a Comment

Statistical model building and the SELECT procedures in SAS

Last week I read an interesting paper by Bob Rodriguez: "Statistical Model Building for Large, Complex Data: Five New Directions in SAS/STAT Software." In it, Rodriguez summarizes five modern techniques for building predictive models and highlights recent SAS/STAT procedures that implement those techniques. The paper discusses the following high-performance (HP) procedures in SAS/STAT software:

  • The GLMSELECT procedure builds parsimonious linear regression models. This procedure supports modern effect selection techniques such as LASSO, LAR, and the elastic net. (Technically, the GLMSELECT procedure pre-dates the HP procedures. However, it is multithreaded. The HPREG procedure provides similar functionality in a distributed environment.)
  • The HPGENSELECT procedure builds generalized linear models such as Poisson models, zero-inflated models, Tweedie models, and more. The procedure supports the selection of effects.
  • The QUANTSELECT procedure builds quantile regression models and features effect selection.
  • The GAMPL procedure fits generalized additive models. I have previously shown the power of the GAMPL procedure to fit binary response data.
  • The HPSPLIT procedure builds classification and regression trees. Tree-based procedures are a popular choice when you want a nonparametric model that is interpretable in terms of business rules.

If you are unfamiliar with these newer procedures, I encourage you to read the Rodriguez paper. Although they are designed for distributed computations, you can also run these procedures in single-machine mode.

Thoughts on the "SELECT" procedures

The GLMSELECT, HPGENSELECT, and (HP)QUANTSELECT procedures support choosing a small number of effects from among a large list of possible effects. Some statisticians call these "variable selection" procedures, but "effect selection" is a more appropriate term because an important use case is to use the procedures to select important interaction effects from the list of all second order interactions. Regardless of what you call them, the procedures automatically select a small number of effects that provide a good predictive model from among hundreds or thousands of possible effects. You can either accept the selected model or use traditional modeling techniques to refine the model from among the selected candidate effects.

While thinking about the use of the word "SELECT" in the names of these procedures, it occurred to me that there is another SAS/STAT procedure that contains the word SELECT, and that is the SURVEYSELECT procedure. However, the SURVEYSELECT procedure is different in that it randomly selects observations (rows in the design matrix) whereas the previous procedures select variables (columns in the design matrix).

This is not the only example of this observation-variable dichotomy in SAS/STAT. The cluster procedures all have "CLUS" as part of their names. The ACECLUS, CLUSTER, FASTCLUS, and MODECLUS procedures all attempt to group observations into clusters. However, the VARCLUS procedure is a dimension reduction technique that groups variables together.

Post a Comment

Do you write unnecessary SAS statements?

I'm addicted to you.
You're a hard habit to break.
Such a hard habit to break.
—  Chicago, "Hard Habit To Break"

Habits are hard to break. For more than 20 years I've been putting semicolons at the end of programming statements in SAS, C/C++, and Java/Javascript. But lately I've been working in a computer language that does not require semicolons. Nevertheless, my fingers have a mind of their own, and I catch myself typing unnecessary semicolons out of habit.

I started thinking about superfluous statements in the SAS language. Some programmers might argue that if the program still runs correctly, then unnecessary statements are inconsequential. However, as a general rule I think it is a good programming practice to avoid writing unnecessary statements.

Here are a few example of unnecessary SAS statement. Can you think of more?

A RUN statement after a DATALINES statement

The doc for the DATALINES statement in the SAS DATA step states: "The DATALINES statement is the last statement in the DATA step. Use a null statement (a single semicolon) to indicate the end of the input data." In other words, you do not need a RUN statement after the semicolon to run the DATA step. The following example runs correctly and creates a data set:

data A;
input x @@;
1 2 3 4 5 6
;                              /* <== no RUN statement required */

How many times have you seen a RUN statement after a DATALINES statement? Countless! I've even seen examples in the SAS documentation that use this unnecessary statement.

A semicolon after a macro call

If you define a macro that contains a complete set of valid SAS statements, you do not need another semicolon when you call the macro. For example, the following example is valid:

%macro TOP(dsname);
   proc print data=&dsname(obs=5); run;
%TOP(sashelp.class)            /* <== no semicolon required */

It's not a big deal if you type the semicolon because a semicolon is the null statement. It has no performance implications. But for some reason it bothers me when I catch myself doing it.

A RUN statement in a fully interactive procedure

In a fully interactive procedure, statements are executed immediately. The RUN statement has no effect. Examples include PROC IML, PROC SQL, and PROC OPTMODEL. You use the QUIT statement to exit these procedures, which means that the RUN statement is never needed. The following program is correct and runs three statements. In interactive mode, each statement gets run when the SAS parser reaches the semicolon that ends the statement.

proc sql;
create table Example (x num, y num);          /* statement 1 */
insert into Example
   values(1, 2)  values(3, 4)  values(5, 6);  /* statement 2 */
select *
   from Example;                              /* statement 3 */
/* no RUN statement required! */

A RUN statement in (some) procedures that support RUN-group processing

Some SAS procedures are partly interactive. Procedures such as PROC DATASETS, PROC REG, and PROC GLM support RUN-group processing. For these procedures, the RUN statement defines blocks of statements that get executed, but the procedure remains running until it encounters a QUIT statement.

Many SAS/STAT procedures interpret QUIT to mean "run the most recent statements and then quit." For these procedures, you do not need a RUN statement before you call QUIT. The following statements run a regression analysis and then quit the procedure:

proc glm data=sashelp.class;
model weight = height;
quit;    /* <== No RUN statement; Runs previous statements, then quits */

Unfortunately, SAS procedures are not completely consistent in implementing the QUIT statement. In some SAS procedures the QUIT statement means "ignore the most recent statements and quit." The canonical examples are the traditional SAS/GRAPH procedures such as PROC GPLOT. In the following program, the first PLOT statement creates a scatter plot because it is followed by a RUN statement. However, the second plot statement is not followed by a RUN statement, so it is ignored and the second plot is not produced.

proc gplot data=sashelp.class;
   plot weight*height;
run;     /* <== executes previous PLOT statement; does not quit */
   plot weight*age;
quit;    /* <== ignores previous PLOT statement, then quits */
/* use RUN; QUIT; to produce the second plot */

If you aren't sure how a procedure behaves with regards to RUN-group processing, it is always safe to use the RUN and QUIT statements in tandem.

When to include optional statements?

The previous sections describe unnecessary statements that I like to skip. However, sometimes I include optional statements in my programs for clarity, readability, or to practice defensive programming.

SAS supports many optional statements. When you omit an optional statement, the procedure does some default behavior. For example, if you omit the VAR statement, most procedures runs on all numerical variables (for example, PROC MEANS) or on all variables (for example, PROC PRINT). When I want the default behavior, I skip the VAR statement.

Another statement that is technically optional is the RUN statement for a sequence of procedures. Because the next call to a procedure or DATA step will always end the previous procedure, you can technically omit the RUN statement for all but the last procedure. This means that the following program is valid, although I do not recommend this style of programming:

data class;
   set sashelp.class(where=(sex='M'));  /* 'class' is the _LAST_ data set */
proc means;                             /* DATA= _LAST_ */
proc print;                             /* DATA= _LAST_ */

If I'm feeling lazy, I might write these statement during the early exploratory phase of a data analysis. However for serious work I terminate every procedure by using a RUN or QUIT statement. Skipping a RUN statement can lead to undesirable interactions with global statements such as the TITLE statement and ODS statements.

Your thoughts?

There is much more that can be said about these topics. What are your thoughts?

  • Are there unnecessary statements that you write out of habit?
  • Are there optional statements that you always include because it makes the program clearer?
Post a Comment

Color markers in a scatter plot by a third variable in SAS

One of my favorite new features in PROC SGPLOT in SAS 9.4m2 is addition of the COLORRESPONSE= and COLORMODEL= options to the SCATTER statement. By using these options, it is easy to color markers in a scatter plot so that the colors indicate the values of a continuous third variable.

If you are running an earlier release of SAS, you can use the Graph Template Language (GTL) to create a scatter plot where the markers are colored by a continuous response variable.

If you have a discrete variable (like gender), you can use the GROUP= option to color markers.

Color markers by a continuous response variable

You can use the COLORRESPONSE= option to visualize the values of a third variable by using colored markers in a scatter plot. For example, the following statements create a scatter plot of weight versus height for 19 students. Each marker is assigned a color that reflects the age of the student.

title "Markers Colored by Age";
proc sgplot data=sashelp.class;
scatter x=height y=weight / colorresponse=age
        markerattrs=(symbol=CircleFilled size=14)  /* big filled markers */
        datalabel=Name;                            /* add labels to markers */
Markers colored by third variable

Click on the image for a larger version. By default, the plot uses a three-color gradient ramp. The smallest value of age (Joyce, age 11) is colored blue. The largest color (Phillip, age 16) is colored red. Markers that correspond to ages near the midrange (in this case, 13.5) are colored black. The gradient color ramp is shown on the right side of the plot. The plot shows that shorter and lighter students tend to be younger than taller and heavier students.

Use a different color ramp

The default color ramp is called ThreeColorAltRamp and is shown above. (The "Alt" version is suitable for use on a light-colored background.) You can change the color ramp by using the COLORMODEL= option in the SCATTER statement. The ODS graphics system provides several other two- and three-color color ramps. One is called the TwoColorRamp color ramp, which uses white as a color for low values and blue as a color for high values. Because the background of the graph is white, some markers will be hard to see unless you take additional actions. The following statements change the background color to a light gray and put a black outline around each marker so that an almost-white marker is still visible against the background:

title2 "Colormodel = TwoColorRamp";
proc sgplot data=sashelp.class;
styleattrs wallcolor=CXF0F0F0;                     /* light gray background */
scatter x=height y=weight / colorresponse=age 
        markerattrs=(symbol=CircleFilled size=14)  /* big filled markers */
        filledoutlinedmarkers                      /* add outlines */
        colormodel=TwoColorRamp;                   /* white-blue ramp */
Color markers by two-color gradient ramp

Define a custom color ramp

You can also use the COLORMODEL= option to create a customer color ramp. There are multiple ways to specify colors in SAS, including English words and hexadecimal integers that begin with the prefix 'CX.' For example, you could create a four-color ramp by using the following COLORMODEL= option:

   colormodel=(blue green orange red);       /* custom 4-color ramp */

Be aware that not all color ramps are created equal. If your color ramp contains colors that vary a lot in brightness and intensity, the viewer's eye will be drawn towards certain markers and aware from others. This can bias the way that the viewer perceives the data.

I wrote a previous article that discussed making wise choices when you create a custom color ramp. In the following statements, I use colors from a six-color "spectral" color ramp in which all colors are perceptually equivalent. (I use the PALETTE function in SAS/IML to generate perceptually balanced color ramps.)

title2 "Custom Colormodel";
proc sgplot data=sashelp.class;
scatter x=height y=weight / colorresponse=age
   markerattrs=(symbol=CircleFilled size=14)
   colormodel=(CX3288BD CX99D594 CXE6F598 CXFEE08B CXFC8D59 CXD53E4F);
xaxis grid; yaxis grid;
Markers colored by custom gradient color ramp

Change the location of the gradient legend

By default the gradient legend appears on the right side of the plot. You can change the location and a few other attributes of the gradient legend by using the GRADIENTLEGEND statement in SAS 9.4m2. For example, if you want the gradient legend to appear at the bottom of the plot and want to modify the title for the legend, you add the following SGPLOT statement:

gradlegend / position=bottom title="Age (yrs)";

In SAS 9.4m3, the COLORRESPONSE= and COLORMODEL= options are not only available in the SCATTER statement, but also are available in other statements, including the DOT, HBAR, HIGHLOW, SERIES, VBAR, VECTOR, and WATERFALL statements.

Post a Comment

Absorbing Markov chains in SAS

Last week I showed how to represent a Markov transition matrix in the SAS/IML matrix language. I also showed how to use matrix multiplication to iterate a state vector, thereby producing a discrete-time forecast of the state of the Markov chain system. This article shows that the expected behavior of a Markov chain can often be determined just by performing linear algebraic operations on the transition matrix.

Absorbing Markov chains

Whereas the system in my previous article had four states, this article uses an example that has five states. The ith row of the following transition matrix gives the probabilities of transitioning from State i to any other state:

proc iml;
/* i_th row contains transition probabilities from State i */
P = { 0    0.5  0    0.5  0,
      0.5  0    0.5  0    0,
      0    0.5  0    0    0.5,
      0    0    0    1    0,
      0    0    0    0    1 };

For example, the last row of the matrix indicates that if the system is in State 5, the probability is 1 that it stays in State 5. This is the definition of an absorbing state. An absorbing state is common for many Markov chains in the life sciences. For example, if you are modeling how a population of cancer patients might respond to a treatment, possible states include remission, progression, or death. Death is an absorbing state because dead patients have probability 1 that they remain dead. The non-absorbing states are called transient. The current example has three transient states (1, 2, and 3) and two absorbing states (4 and 5).

If a Markov chain has an absorbing state and every initial state has a nonzero probability of transitioning to an absorbing state, then the chain is called an absorbing Markov chain. The Markov chain determined by the P matrix is absorbing. For an absorbing Markov chain, you can discover many statistical properties of the system just by using linear algebra. The formulas and examples used in this article are taken from the online textbook by Grimstead and Snell.

The first step for analyzing an absorbing chain is to permute the rows and columns of the transition matrix so that all of the transient states are listed first and the absorbing states are listed last. (The P matrix is already in this form.) If there are k absorbing states, the transition matrix in block form looks like the following:

Block form of an absorbing Markov transition matrix

The bottom right block of the transition matrix is a k x k identity matrix and represents the k absorbing states. The top left block contains the probabilities of transitioning between transient states. The upper right block contains the probabilities of transitioning from a transient state to an absorbing state. The lower left block contains zeros because there is no chance of transitioning from an absorbing state to any other state.

The following SAS/IML statements show how to extract the Q and R matrices from the P matrix:

k = sum(vecdiag(P)=1);      /* number of absorbing states */
nt = ncol(P) - k;           /* number of transient states */
Q = P[1:nt, 1:nt];          /* prob of transitions between transient states */ 
R = P[1:nt, nt+1:ncol(P)];  /* prob of transitions to absorbing state */

Expected behavior of absorbing Markov chains

By definition, all initial states for an absorbing system will eventually end up in one of the absorbing states. The following questions are of interest. If the system starts in the transient state i, then:

  1. What is the expected number of steps the system spends in transient state j?
  2. What is the expected number of steps before entering an absorbing state?
  3. What is the probability that the system will be absorbed into the jth absorbing state?

The answers to these questions are obtained by defining the so called fundamental matrix, which is N = (I-Q)-1. The fundamental matrix answers the first question because the entries of N are expected number of steps that the system spends in transient state j if it starts in transient state i:

transState = char(1:nt);        /* labels of transient states */
absState = char(nt+1:ncol(P));  /* labels of absorbing states */
/* Fundamental matrix gives the expected time that the system is 
   in transient state j if it starts in transient state i */
N = inv(I(nt) - Q);  
print N[L="Expected Time in State j" r=transState c=transState];
Expected time in State j for an absorbing Markov chain

The first row indicates that if the system starts in State 1, then on average it will spend 1.5 units of time in State 1 (including the initial time period), 1 unit of time in State 2, and 0.5 units of time in State 3 before transitioning to an absorbing state. Similarly, if the system starts in State 2, you can expect 1 unit, 2 units, and 1 unit of time in States 1, 2, and 3, respectively, before transitioning to an absorbing state.

Obviously, if you sum up the values for each row, you get the expect number of steps until the system transitions to an absorbing state:

/* Expected time before entering an absorbing state if the system
   starts in the transient state i */
t = N[,+];
print t[L="Expected Time until Absorption" r=transState];
Expected ime until absorption for an absorbing Markov chain

The remaining question is, to me, the most interesting. If the system starts in a transient state, you know that it will eventually transition into one of the k absorbing states. But which one? What is the probability that the system ends in the jth absorbing state if it starts in the ith transient state? The answer is obtained by the matrix multiplication N*R because the matrix N is the expected number of steps in each transient state and R contains the probability of transitioning from a transient state to an absorbing state. For our example, the computation is as follows:

/* The probability that an absorbing chain will be absorbed
   into j_th absorbing state if it starts in i_th transient state */
B = N*R;
print B[L="Absorption Probabilities" r=transState c=absState];
Absorption probabilities for an absorbing Markov chain with two absorbing states

The first row of the matrix indicates that if the system starts in State 1, it will end up in State 4 three quarters of the time and in State 5 one quarter of the time. The second rows indicates that if the system starts in State 2, there is a 50-50 chance that it will end up in State 4 or State 5.

Because this Markov chain is a stochastic process, you cannot say with certainty whether the system will eventually be absorbed into State 4 or State 5. However, starting the system in State 1 means that there is a higher probability that the final state will be State 4. Similarly, starting in State 3 means a higher probability that the final state will be in State 5.

There are many other statistics that you can compute for absorbing Markov chains. Refer to the references for additional computations.


Post a Comment

Break a sentence into words in SAS

Two of my favorite string-manipulation functions in the SAS DATA step are the COUNTW function and the SCAN function. The COUNTW function counts the number of words in a long string of text. Here "word" means a substring that is delimited by special characters, such as a space character, a period, or a comma. The SCAN function enables you to parse a long string and extract words. You can specify the delimiters yourself or use the default delimiters. Ron Cody discusses these and other string manipulation functions in his excellent 2005 tutorial, "An Introduction to SAS Character Functions."

Using the COUNTW and SCAN functions in the DATA step

For example, the following DATA step reads in a long line of text. The COUNTW function counts how many words are in the string. A loop then iterates over the number of words and the SCAN function extracts each word into a variable:

data parse;
length word $20;                 /* number of characters in the longest word */
input str $ 1-80;
delims = ' ,.!';                 /* delimiters: space, comma, period, ... */
numWords = countw(str, delims);  /* for each line of text, how many words? */
do i = 1 to numWords;            /* split text into words */
   word = scan(str, i, delims);
drop str delims i;
Introduction,to SAS/IML   programming!
Do you have... a question?
proc print data=parse;

Notice that the delimiters do not include the '/' or '?' characters. Therefore these characters are considered to be part of words. For example, the strings "SAS/IML" and "question?" include those non-letter characters. Notice also that consecutive delimiters are automatically excluded, such as extra spaces or the ellipses marks.

Creating a vector of words in SAS/IML

One of the advantages of the SAS/IML matrix language is that you can call the hundreds of functions in Base SAS. When you pass in a vector of arguments to a Base SAS function, the function returns a vector that is the same size and shape as the parameter. In this way, you can vectorize the calling of Base SAS functions. In particular, you can pass in a vector of indices to the SCAN function and get back a vector of words. You do not need to write a loop to extract multiple words, as the following example demonstrates:

proc iml;
s = "Introduction,to SAS/IML... programming!";
delims = ' ,.!'; 
n = countw(s, delims);  
words = scan(s, 1:n, delims);  /* pass parameter vector: create vector of words */
print words;

In summary, Base SAS provides many useful functions such as the string manipulation functions. This article shows that when you call these functions from SAS/IML and pass in a parameter vector, you get back a vector of results.

Post a Comment

Markov transition matrices in SAS/IML

Many computations in elementary probability assume that the probability of an event is independent of previous trials. For example, if you toss a coin twice, the probability of observing "heads" on the second toss does not depend on the result of the first toss.

However, there are situations in which the current state of the system determines the probability of the next state. A stochastic process in which the probabilities depend on the current state is called a Markov chain.

A Markov transition matrix models the way that the system transitions between states. A transition matrix is a square matrix in which the (i,j)th element is the probability of transitioning from state i into state j. The sum of each row is 1. For reference, Markov chains and transition matrices are discussed in Chapter 11 of Grimstead and Snell's Introduction to Probability.

A transition matrix of probabilities

A Wikipedia article on Markov chains uses a sequence of coin flips to illustrate transitioning between states. This blog post uses the same example, which is described below.

Imagine a game in which you toss a fair coin until the sequence heads-tails-heads (HTH) appears. The process has the following four states:

  • State 1: No elements of the sequence are in order. If the next toss is tails (T), the system stays at State 1. If the next toss is heads (H), the system transition to State 2.
  • State 2: H. The first element of the sequence is in order. If the next toss is H, the system stays at State 2. If the next toss is T, the system transitions to State 3.
  • State 3: HT. The first two elements of the sequence in order. If the next toss is H, transition to State 4. If the next toss is T, transition to State 1.
  • State 4: HTH. The game is over. Stay in this state.

The transition matrix is given by the following SAS/IML matrix. The first row contains the transition probabilities for State 1, the second row contains the probabilities for State 2, and so on.

proc iml;
/* Transition matrix. Columns are next state; rows are current state */
/*     Null  H   HT  HTH */
P =    {0.5  0.5 0   0,   /* Null */
        0    0.5 0.5 0,   /* H    */
        0.5  0   0   0.5, /* HT   */
        0    0   0   1};  /* HTH  */
states = "1":"4";
print P[r=states c=states L="Transition Matrix"];

Analyzing sequences of coin tosses can be interesting and sometimes counterintuitive. Let's describe the expected behavior of this system.

The state vector

You can use a four-element row vector to represent the probabilities that the system is in each state. If the system is in the ith state, put a 1 in the ith element and zero elsewhere. Thus the state vector {1 0 0 0} indicates that the system is in State 1.

You can also use the state vector to represent probabilities. If the system has a 50% chance of being in State 1 and a 50% chance of being in State 2, the state of the system is represented by the state vector {0.5 0.5 0 0}.

The time-evolution of the system can be studied by multiplying the state vector and the transition matrix. If s is the current state of the system, then s*P gives the vector of probabilities for the next state of the system. Similarly, (s*P)*P = s*P2 describes the probabilities of the system being in each state after two tosses.

For the HTH-sequence game, suppose that you start a new game. The game starts in State 1. The following computation describes the evolution of the state vector:

s0 = {1 0 0 0};    /* a new game is in State 1 */
s1 = s0 * P;       /* probability distribution of states after 1 toss */
s2 = s1 * P;       /* probability distribution of states after 2 tosses */
print (s1//s2)[L="Prob of State" c=("1":"4") r={"1 toss" "2 tosses"}];
First two states in a Markov chain

The first row of the output gives the state probabilities for the system after one coin toss. The system will be in State 1 with probability 0.5 (if you toss T) and will be in State 2 with probability 0.5 (if you toss H). The second row is more interesting. The computation use the probabilities from the first toss to compute probabilities for the second toss. After two tosses, the probability is 0.25 for being in State 1 (TT), the probability is 0.5 for being in State 2 (TH or HH), and the probability is 0.25 for being in State 3 (HT).

Modeling a sequence of coin tosses

In general you can repeatedly multiple the state vector by the transition matrix to find the state probabilities after k time periods. For efficiency you should avoid concatenating results inside a loop. Instead, allocate a matrix and use the ith row to hold the ith state vector, as follows:

/* Iterate to see how the probability distribution evolves */
numTosses = 10;
s0 = {1 0 0 0};                     /* initial state */
s = j(numTosses+1, ncol(P), .);     /* allocate room for all tosses */
s[1,] = s0;                         /* store initial state */
do i = 2 to nrow(s);
   s[i,] = s[i-1,] * P;             /* i_th row = state after (i-1) iterations */
iteration = 0:numTosses;            /* iteration numbers */
print s[L="Prob of State" c=("1":"4") r=(char(iteration))];
First 10 states in a Markov chain with one absorbing state

The output shows the state probabilities for a sequence of 10 coin tosses. Recall that the last row of the transition matrix ensures that the sequence stays in State 4 after it reaches State 4. Therefore the probability of the system being in State 4 is nondecreasing. The output shows that there is a 65.6% chance that the sequence HTH will appear in 10 tosses or less.

You can visualize the evolution of the probability distributions by making a series plot for each column of this output. You can download the SAS program that creates the plot and contains all of the computations in this article. The plot is shown below:

Predicted states for a Markov chain by iterating a trasition matrix

The plot shows the probability distributions after each toss when the system starts in State 1. After three time periods the system can be in any state. Over the long term, the system has a high probability of being in State 4 and a negligible chance of being in the other three states.

Modeling transitions in a population

You can also apply the transition matrix to a population of games. For example, suppose that many students are playing the game. At a certain instant, 50% of the games are in State 1, 30% are in State 2, and 20% are in State 3. You can represent that population of initial states by using the initial vector

s0 = {0.5 0.3 0.2 0};

The following graph gives the probability of the states for the population for the next 30 coin tosses:

Predicted states for a Markov chain by iterating a trasition matrix

The initial portion of the graph looks different from the previous graph because the population starts in a different initial distribution of states. However, the long-term behavior of this system is the same: all games eventually end in State 4. For this initial population, the graph shows that you should expect 80% of the games to be in State 4 by the 13th toss.

A real example: Predicting caseloads for social workers and parole officers

An interesting application of using Markov chains was presented by Gongwei Chen at SAS Global Forum 2016. Chen built a discrete-time Markov chain model to forecast workloads at the Washington State Department of Corrections. Social workers and parole officers supervise convicted offenders who have been released from prison or who were sentenced to supervision by the court system. Individuals who exhibit good behavior can transition from a highly supervised situation into less supervision. Other individuals might commit a new offense that requires an increase in supervision. Chen used historical data to estimate the transition matrix between the different supervisory states. His model helps the State of Washington forecast the resources needed to supervise offenders.


In summary, it is easy to represent a transition matrix and a state vector in SAS/IML. You can iterate the initial distribution of states to forecast the state of the system after an arbitrary number of time periods. This is done by using matrix multiplication.

A Markov chain model involves iterating a linear dynamical system. The qualitative asymptotic behavior of such systems can be described by using the tools of linear algebra. In a future article, I will describe how you can compute statistical properties of Markov chain models from the transition matrix.

Post a Comment

Cantor sets, the devil's staircase, and probability

Last week I blogged about how to draw the Cantor function in SAS. The Cantor function is used in mathematics as a pathological example of a function that is constant almost everywhere yet somehow manages to "climb upwards," thus earning the nickname "the devil's staircase."

The Cantor function has three properties: (1) it is continuous, (2) it is non-decreasing, and (3) F(0)=0 and F(1)=1. There is a theorem that says that any function with these properties is the cumulative distribution function (CDF) for some random variable. In theory, therefore, you can generate a large number of random variates (a large sample), then use PROC UNIVARIATE to plot the empirical CDF. The empirical CDF should be the devil's staircase function.

A random sample from the Cantor set

My last article showed that every point in the Cantor set can be written as the infinite sum x = Σi ai3-i where the ai equal 0 or 2. You can approximate the Cantor set by truncating the series at a certain number of terms. You can generate random points from the (approximate) Cantor set by choosing the coefficients randomly from the Bernoulli distribution with probability 0.5. The following DATA step generates 10,000 points from the Cantor set by using random values for the first 20 coefficients:

data Cantor;
call streaminit(123456);
k = 20;             /* maximum number of coefficients */
n = 10000;          /* sample size to generate */
do i = 1 to n;      /* generate n random points in Cantor set */
   x = 0;
   do j = 1 to k;   /* each point is sum of powers of 1/3 */
      b = 2 * rand("Bernoulli", 0.5);  /* random value 0 or 2 */
      x + b * 3**(-j);                 /* power of 1/3 */
keep x;
proc univariate data=Cantor;
   cdf x;

The call to PROC UNIVARIATE displays the empirical CDF for the random sample of points. And voilà! The Cantor distribution function appears! As a bonus, the output from PROC UNIVARIATE (not shown) indicates that the mean of the distribution is 1/2 (which is obvious by symmetry) and the variance is 1/8, which you might not have expected.

This short SAS program does two things: it shows how to simulate a random sample uniformly from the (approximate) Cantor set, and it indicates that the devil's staircase function is the distribution function for this a uniform random variable.

A random sample in SAS/IML

Alternatively, you can generate the same random sample by using SAS/IML and matrix computations. My previous article drew the Cantor function by systematically using all combinations of 0s and 2s to construct the elements of the Cantor set. However, you can use the same matrix-based approach but generate random coefficients, therefore obtaining a random sample from the Cantor set:

proc iml;
k = 20;          /* maximum number of coefficients */
n = 10##4;       /* sample size to generate */
B = 2 * randfun(n||k, "Bernoulli", 0.5);  /* random n x k matrix of 0/2 */
v = 3##(-(1:k)); /* powers of 1/3 */
x = B * v`;      /* sum of random terms that are either 0*power or 2*power */

Recall that ECDF is a step function that plots the ith ordered datum at the height i/n. You can approximate the empirical CDF in SAS/IML by using the SERIES subroutine. Technically, the lines that appear in the line plot have a nonzero slope, but the approximation looks very similar to the PROC UNIVARIATE output:

call sort(x, 1);             /* order elements in the Cantor set */
x = 0 // x // 1;             /* append end points */
y = do(0, 1, 1/(nrow(x)-1)); /* empirical CDF */
title "Empirical Distribution of a Uniform Sample from the Cantor Set";
call series(x, y);

Interpretation in probability

It is interesting that you can actually define the Cantor set in terms of rolling a fair six-sided die. Suppose that you roll a die infinitely often and we adopt the following rules:

  • If the die shows a 1 or 2, you get zero points.
  • If the die shows a 3 or 4, you get one point.
  • If the die shows a 5 or 6, you get two points.

This game is closely related to the Cantor set. Recall that the Cantor set can be written as the set of all base-3 decimal values in [0,1] for which the decimal expansion does not contain a 1. In this game, each point in the Cantor set corresponds to a sequence of rolls that never contain a 3 or 4. (Equivalently, the score is always even.) Obviously this would never happen in real life, which is an intuitive way of saying that the Cantor set has "measure zero."


The first time I saw the Cantor function depicted as an empirical distribution function was when I saw a very compact MATLAB formula like this:

stairs([min(x) sort(x)],0:1/length(x):1) % Plot the c.d.f of x

The previous formula is equivalent to my SAS/IML program, but in my program I broke the formula apart so that the components could be understood more easily. This formula appears in a set of probability demos by Peter Doyle. I had the privilege to interact with Professor Doyle at The Geometry Center (U. MN) in the early 1990s, so perhaps he was responsible for showing me the Cantor distribution.

These UNC course notes from Jan Hannig also discuss the Cantor distribution, but the original source is not cited.

Post a Comment

Visualize the Cantor function in SAS


I was a freshman in college the first time I saw the Cantor middle-thirds set and the related Cantor "Devil's staircase" function. (Shown at left.) These constructions expanded my mind and led me to study fractals, real analysis, topology, and other mathematical areas.

The Cantor function and the Cantor middle-thirds set are often used as counter-examples to mathematical conjectures. The Cantor set is defined by a recursive process that requires infinitely many steps. However, you can approximate these pathological objects in a matrix-vector language such as SAS/IML with only a few lines of code!

Construction of the Cantor middle-thirds set

The Cantor middle-thirds set is defined by the following iterative algorithm. The algorithm starts with the closed interval [0,1], then does the following:

  1. In the first step, remove the open middle-thirds of the interval. The two remaining intervals are [0,1/3] and [2/3, 1], which each have length 1/3.
  2. In the ith step, remove the open middle-thirds of all intervals from the (i-1)th step. You now have twice as many intervals and each interval is 1/3 as long as the intervals in the previous step.
  3. Continue this process forever.

After two steps you have four intervals: [0,1/9], [2/9,1/3], [2/3, 7/9], and [8/9,1]. After three steps you have eight intervals of length 1/27. After k steps you have 2k intervals of length 3-k.

The Cantor set is the set of all points that never get removed during this infinite process. The Cantor set clearly contains infinitely many points because it contains the endpoints of the intervals that are removed: 0, 1, 1/3, 2/3, 1/9. 2/9, 7/9, 8/9, and so forth. Less intuitive is the fact that the cardinality of the Cantor set is uncountably infinite even though it is a set of measure zero.

Construction of the Cantor function

The Cantor function F: [0,1] → [0,1] can be defined iteratively in a way that reflects the construction of the Cantor middle-thirds set. The function is shown at the top of this article.

  • At step 0, define the function on the endpoints of the [0,1] interval by F(0)=0 and F(1)=1.
  • At step 1, define the function on the closed interval [1/3, 2/3] to be 1/2.
  • At step 2, define the function on [1/9, 2/9] to be 1/4. Define the function on [7/9, 8/9] to be 3/4.
  • Continue this construction. At each step, the function is defined on the closure of the middle-thirds intervals so that the function value is halfway between adjacent values. The function looks like a staircase with steps of difference heights and widths. The function is constant on every middle-thirds interval, but nevertheless is continuous and monotonically nondecreasing.

Visualizing the Cantor function in SAS

This is a SAS-related blog, so I want to visualize the Cantor function in SAS. The middle-third intervals during the kth step of the construction have length 3-k, so you can stop the construction after a small number of iterations and get a decent approximation. I'll use k=8 steps.

Although the Cantor set and function were defined geometrically, they have an equivalent definition in terms of decimal expansion. The Cantor set is the set of decimal values that can be written in base 3 without using the '1' digit. In other words, elements of the Cantor set have the form x = 0.a1a2a3... (base 3), where ai equals 0 or 2.

An equivalent definition in terms of fractions is x = Σi ai3-i where ai equals 0 or 2. Although the sum is infinite, you can approximate the Cantor set by truncating the series after finitely many terms. A sum like this can be expressed as an inner product x = a*v` where a is a k-element row vector that contains 0s and 2s and v is a vector that contains the elements {1/3, 1/9, 1/27, ..., 1/3-k}.

You can define B to be a matrix with k columns and 2k rows that contains all combinations of 0s and 2s. Then the matrix product B*v is an approximation to the Cantor set after k steps of the construction. It contains the right-hand endpoints of the middle-third intervals.

In SAS/IML you can use the EXPANDGRID function to create a matrix whose rows contain all combinations of 0s and 2s. The ## operator raises an element to a power. Therefore the following statements construct and visualize the Cantor function. With a little more effort, you can write a few more statements that improve the approximation and add fractional tick marks to the axes, as shown in the graph at the top of this article.

proc iml;
/* rows of B contain all 8-digit combinations of 0s and 2s */ 
B = expandgrid({0 2}, {0 2}, {0 2}, {0 2}, {0 2}, {0 2}, {0 2}, {0 2});
B = B[2:nrow(B),];      /* remove first row of zeros */
k = ncol(B);            /* k = 8 */
v = 3##(-(1:k));       /* vector of powers 3^{-i} */
t = B * v`;            /* x values: right-hand endpts of middle-third intervals */
u = 2##(-(1:k));       /* vector of powers 2^{-i} */
f = B/2 * u`;          /* y values: Cantor function on Cantor set */
call series(t, f);     /* graph the Cantor function */

I think this is a very cool construction. Although the Cantor function is defined iteratively, there are no loops in this program. The loops are replaced by matrix multiplication and vectors. The power of a matrix language is that it enables you to compute complex quantities with only a few lines of programming.

Do you have a favorite example from math or statistics that changed the way that you look at the world? Leave a comment.


This short article cannot discuss all the mathematically interesting features of the Cantor set and Cantor function. The following references are provided for the readers who want additional information:

Post a Comment