The SELECT statement in the SAS DATA step

Every beginning SAS programmer learns the simple IF-THEN/ELSE statement for conditional processing in the SAS DATA step. The basic If-THEN statement handles two cases: if a condition is true, the program does one thing, otherwise the program does something else.

Of course, you can handle more cases by using multiple ELSE IF statements. I have seen SAS programs that contain contains dozens of ELSE clauses. Sometimes a long sequence of IF-THEN/ELSE statements is necessary, such as when you are testing complex logical conditions.

Flow chart diagram for a switch statement (SELECT statement)

An alternative control statement in SAS is the SELECT-WHEN statement. The SELECT-WHEN statement (sometimes simply called the SELECT statement) enables you to conditionally execute statements based on the value of a single categorical variable. Usually the variable can have three or more valid values that you want to handle.

The following example uses the Sashelp.Heart data set, which contains data for 5,167 patients in a medical study. The Smoking_Status variable is a categorical variable that encodes the average number of cigarettes that each patient smokes per day. The following DATA step view implements a recoding scheme, which is sometimes the easiest way to force levels of a nominal variable to appear in a natural order during a SAS analysis.

/* example of using the SELECT statement */
data Heart / view=Heart;
set sashelp.heart;
select (Smoking_Status);
   when ('Non-smoker')        Smoking_Cat=1;
   when ('Light (1-5)')       Smoking_Cat=2;
   when ('Moderate (6-15)')   Smoking_Cat=3;
   when ('Heavy (16-25)')     Smoking_Cat=4;
   when ('Very Heavy (> 25)') Smoking_Cat=5;
   otherwise                  Smoking_Cat=.;
end;
run;

The SELECT-WHEN statement is easy to read. You specify the name of a variable on the SELECT statement. You then list a sequence of WHEN statements. Each WHEN statement specifies a particular value for the variable. If the variable has that value, the program conditionally executes a statement, which in this example assigns a value to the Smoking_Cat variable.

Notice that you can use the OTHERWISE keyword to handle missing values, invalid data, or default actions.

You can also combine categories in a WHEN statement. For example, in a statistical analysis you might want to combine the 'Heavy' and 'Very Heavy' categories into a single group. In the WHEN statement you can specify multiple values in a comma-separated list:

   /* combine the 'Heavy' and 'Very Heavy' categories */
   when ('Heavy (16-25)', 'Very Heavy (> 25)') Smoking_Cat=4;

If the WHEN condition is true, the program will execute one statement. This is the same rule that the IF-THEN statement follows. To execute more than one statement, use a DO-END block, which groups statements together:

   when ('Non-smoker') do;       /* execute multiple statements */
      Smoking_Cat=1;
      IsSmoker = 0;
   end;

I use the SELECT-WHEN statement as a "table lookup" when a program needs to branch according to the value of a single categorical variable that has three or more valid values. The basic SELECT-WHEN statement is not as flexible as the IF-THEN/ELSE statement, but, when applicable, it results in very clean and easy-to-read programs.

Other languages have similar branching statements. The SQL language supports a CASE-WHEN statement. The C/C++ and Java/Javascript languages support a switch-case statement. Whereas the CASE-WHEN statement in SAS executes one statement, the switch-case statement implements fallthrough, so C-programmers often use the break statement to exit the switch block.

Some languages do not support a special switch statement, but instead require that you use IF-THEN/ELSE statements. Python and the SAS/IML language fall into this category.

There is an alternative syntax for the SELECT-WHEN statement that does not specify an expression in the SELECT statement. Instead, you specify logical conditions in the WHEN statements. This alternate syntax is essentially equivalent to an IF-THEN/ELSE statement, so which syntax you use is a matter of personal preference. Personally, I use SELECT-WHEN for branching on a known set of discrete values, and I use the IF-THEN/ELSE statement to handle more complex situations.

Post a Comment

Overlay plots on a box plot in SAS: Continuous X axis

I have previously shown how to overlay basic plots on box plots when all plots share a common discrete X axis. It is interesting to note that box plots can also be overlaid on a continuous (interval) axis. You often need to bin the data before you create the plot.

A typical situation when you plot a time series. You might want to overlay box plots to display a summary of the response distribution within certain time intervals. For example, last week I visualized World Bank data for the average life expectancy in more than 200 countries during the years 1960–2014. Suppose that for each decade you want to draw a box plot that summarizes the response variable for all countries. You can use the DATA step (or create a data view) to create the Decade variable, as follows:

data Decade / view=decade;
set LE;
if      1960 <= Year <= 1969 then Decade=1965;
else if 1970 <= Year <= 1979 then Decade=1975;
else if 1980 <= Year <= 1989 then Decade=1985;
else if 1990 <= Year <= 1999 then Decade=1995;
else if 2000 <= Year <= 2009 then Decade=2005;
else if 2010 <= Year         then Decade=2015;
else Decade = .;   /* handle bad data */
run;

You can use the following call to PROC SGPLOT to overlay the box plot and the scatter plot of the data:

/* overlay data by year and box plots to summarize decades */
title "Distribution of Life Expectancy by Decade";
title2 "Raw Data for 207 Countries by Year";
proc sgplot data=Decade noautolegend;
   vbox Expected / category=Decade;              /* box plot */
   scatter x=year y=Expected / transparency=0.9; /* semitransparent scatter */
   xaxis type=linear display=(nolabel)           /* linear axis */
         values=(1965 to 2015 by 10)             /* optional: set ticks labels */
         valuesdisplay=('60s' '70s' '80s' '90s' '2000s' '2010s');
run;
Box plots by decade overlaid with life expectancy data by year

There are a couple of tricks that make this overlay work:

  • The discrete Decade variable uses the same scale as the continuous Year variable. In fact, the Decade value is in the middle of the continuous interval for each decade.
  • The scatter plot markers are highly transparent so that they show the individual measurements without overwhelming the display.
  • The TYPE=LINEAR option in the XAXIS statement enables you to overlay the scatter plot which has an interval X axis) and the box plot (which has discrete X values).
  • Optionally, you can specify the location of the major tick marks by using the VALUES= option in the XAXIS statement. In this example, the tick marks are set to be 1965, 1975, and so forth, which are the values of the Decade variable. To emphasize the decades, rather than particular years, you can use the VALUESDIPLAY= option to manually set the values that are displayed for each tick mark.

Overlay regression lines on box plots

In my previous article I showed how you can use the CONNECT= option to connect quantiles of adjacent box plots. Some people use the CONNECT= option as a poor man's version of quantile regression. However, quantile regression is more than merely connecting the sample quantiles of binned data. Quantile regression shows trends for various quantiles of the response variable.

Box plots by decade; overlay quantile regression of life expectancy

For brevity, I will not discuss how to use PROC QUANTREG to perform quantile regression on these data. However, you can download the program that performs the analysis and creates the graph.

The graph overlays quantile regression lines on the previous graph. Notice that the 90th and 75th quantiles of life expectancy are increasing at a rate of about 2 years per decade. The median life expectancy is increasing at a faster rate of 3 years per decade. The lower quantiles are increasing even faster: the 25th quantile is increasing at about 4 years per decade and the 10th quantile is increasing at about 3.6 years per decade. Notice that the 25th, 50th, and 75th quantile curves are close to (but not equal to) the corresponding features of the aggregated box plots.

I think that this graph provides an excellent visualization of trends in life expectancy around the world, especially if combined with spaghetti plots or lasagna plots. The countries at the top of the plot have good sanitation, health care, and nutrition. Consequently, their life expectancy increases at a slower rate than countries that have poorer sanitation and nutrition. Small improvements in the poorer countries can make a big difference in the life expectancy of their population.

Summary

In summary, you can overlay continuous plots and box plots in SAS 9.4m1. You need to ensure that the categories of the box plots are on the same scale as the continuous X variable. You need to set the TYPE=LINEAR option on the XAXIS statement. Lastly, you might want to use partial transparency to ensure that the graph doesn't become too crowded.

Post a Comment

Overlay plots on a box plot in SAS: Discrete X axis

Box plots summarize the distribution of a continuous variable. You can display multiple box plots in a single graph by specifying a categorical variable. The resulting graph shows the distribution of subpopulations, such as different experimental groups. In the SGPLOT procedure, you can use the CATEGORY= option on the VBOX statement to generate box plots for each level of a categorical variable.

Sometimes you need to overlay additional points or lines on box plots. SAS 9.4M1 and beyond supports overlaying "basic plots" and box plots. There are many different basic plot types, including the scatter plot and the series (line) plot.

This article shows examples when the X axis is discrete. In a subsequent article I will discuss the case of a continuous X axis.

Connect features of adjacent box plots

The simplest way to overlay features on a set of box plots is to use the CONNECT= option on the VBOX statement. The CONNECT= option (which also requires SAS 9.4M1) enables you to overlay line segments that connect the means or selected quantiles (max, min, or quartiles) of adjacent box plots. The article by Sanjay Matange provides details about how to use the CONNECT= option. The CONNECT= option is useful when you want to visually emphasize how the mean (or quartiles) change between levels of a classification variable.

You can use a trick to add multiple CONNECT= options. For example, if you want to connect the first, second, and third quartile values, you can repeat the VBOX statement multiple times, but use the NOFILL and NOOUTLIERS options on all but the first statement. The following statement provides an example. Notice that the NOCYCLEATTRS option prevents the plot from using different colors for the different VBOX statements:

title "Fuel Efficiency by Origin";
proc sgplot data=sashelp.cars noautolegend nocycleattrs;      /* suppress cycling attributes */
vbox mpg_city / category=origin connect=median;               /* draw box and connect medians */
vbox mpg_city / category=origin nofill nooutliers connect=Q1; /* connect Q1 */
vbox mpg_city / category=origin nofill nooutliers connect=Q3; /* connect Q3 */
run;
Box plots with lines connecting the Q1, median, and Q3 quartiles

Overlay arbitrary line segments and points on box plots

In the same article, Sanjay shows how to overlay line segments that connect any precomputed quantities. For example, you can use PROC MEANS to compute statistics for each category and then use one or more SERIES statements to display line segments that connect the statistics.

In a similar way you can overlay markers that represent special observations or reference values. For example, suppose the Sashelp.Class data set contains data about the heights and weights of a teacher's current students. When two new foreign exchange students (Ivan and Anna) are assigned to her class, she notices that these students are exceptionally tall compared to her current students. She decides to overlay the new student's heights on box plots (one for each gender) that show the height distributions of her current students. One way to do this is to create a box plot of the original data and then overlay a scatter plot of the new observations.

The following SAS program

  1. Creates a data set with the new data.
  2. Concatenates the original and the new data. To overlay the plots they should have a common X axis. In this case, the new data can share the Name, Sex, and Age variables of the original data, but you should create NEW variables that describe the heights and weights of the new students. In this way only the original observations are used to construct the box plots and only the new observations are used to overlay the scatter plot.
  3. Uses the SGPLOT procedure to overlay the box plots and the scatter plot. Use the VBOX statement with CATEGORY=Sex to create the box plot and use the SCATTER statement with X=Sex to overlay the scatter plot.
data NewStudents;
length Name $8 Sex $1;
input Name Sex Age Height Weight;
datalines;
Ivan    M 14 71.5 110 
Anna    F 13 68   105
;
 
data All;          /* concatenate: use same X variable but different Y variables */
set sashelp.class
    NewStudents(rename=(Height=NewHeight Weight=NewWeight));
run;
 
proc sgplot data=All noautolegend;
vbox height / category=sex;
scatter x=sex y=NewHeight / datalabel=Name
        markerattrs=(color=red symbol=CircleFilled size=12);
run;
Box plots with markers that indicated special values

Overlay multiple plots on box plots

It is just as easy to overlay multiple plots. Recall that plots are rendered in the order that they appear in the SGPLOT procedure. This means that plots that change the background (such as a BLOCK plot) should be specified first, and plots that are use markers should be specified last. Another option is to use the TRANSPARENCY= option judiciously so that features in earlier plots are visible even though other plots are overlaid on top.

The following example demonstrates overlaying three plots. The plot shows the distribution of the fuel economy (measured as miles per gallon) for different kinds of vehicles. A block plot in the background emphasizes that hybrid vehicles have the best fuel economy. The box plot then shows the distribution of the fuel economy for six types of vehicles. The BOXWIDTH= option is used to control the width of the box plots. Lastly, the graph displays a scatter plot of the response variable for each model of vehicle. The JITTER option is used to reduce overplotting.

data cars / view=cars;
set Sashelp.Cars;
if type="Hybrid" then FuelType = "Hybrid      ";
else FuelType = "Conventional";
run;
 
proc sgplot data=cars noautolegend;
block x=type block=FuelType / filltype=alternate 
         fillattrs=(color=LightGray)altfillattrs=(color=white);
vbox mpg_city / category=type boxwidth=0.8 nooutliers;
scatter  x=type y=mpg_city / jitter transparency=0.6 
         markerattrs=(color=red symbol=CircleFilled);
yaxis offsetmax=0.1;
run;
boxdiscrete3

The resulting graph looks complicated, but is created by using only a few statements. Again, the key is that each of the three plot types share a common, discrete, X variable.

This article provides several examples of overlaying a box plot with other plots that share a discrete X axis. In my next blog post I will discuss how you can use box plots to summarize conditional distributions when the X axis is continuous.

Post a Comment

Lasagna plots in SAS: When spaghetti plots don't suffice

Last week I discussed how to create spaghetti plots in SAS. A spaghetti plot is a type of line plot that contains many lines. Spaghetti plots are used in longitudinal studies to show trends among individual subjects, which can be patients, hospitals, companies, states, or countries. I showed ways to ease overplotting in spaghetti plots, but ultimately the plots live up to their names: When you display many individual curves the plot becomes a tangled heap of noodles that is hard to digest.

Lasagna plot of life expectancy

An alternative is to use a heat map instead of a line plot, as shown to the left. (This graph is created later in this article.) Each row of the heat map represents a subject. Each column represents a time point. Heat maps are useful when a response variable is recorded for every individual at the same set of uniformly spaced time points, such as daily, monthly, or yearly.

In a cleverly titled paper, Swihart et al. (2010) proposed the name "lasagna plot" to denote a heat map that visualizes longitudinal data. Whereas the spaghetti plot allows "noodles" (the individual curves) to cross, a lasagna plot layers the noodles horizontally, one on top of another. A related visualization is a calendar chart (Mintz, Fitz-Simons, and Wayland (1997); Allison and Zdeb (2005)), which also uses colored tiles to convey information about a response variable as a function of time.

This article shows how to create lasagna plots in SAS. To create a lasagna plot in SAS you can:

You can download the SAS program used to create all the graphs in this article.

Create a basic lasagna plot in SAS

In a previous article I showed how to download the World Bank data for the average life expectancy in more than 200 countries during the years 1960–2014. After downloading the data, the data are transformed from "wide form" into "long form." The following call to PROC SGPLOT creates a spaghetti plot of the "Low Income" countries.

ods graphics / imagemap=ON;   /* enable data tips */
title "Life Expectancy at Birth";
title2 "Low-Income Countries";
proc sgplot data=LE;          /* create conventional spaghetti plot */
   where income=5;            /* extract the "low income" companies */
   format Country_Name $10.;  /* truncate country names */
   series x=Year y=Expected / group=Country_name break curvelabel
       lineattrs=(pattern=solid) tip=(Country_Name Region Year Expected);
run;
Spaghetti plot of life expectancy

This spaghetti plot is not very enlightening. There are 31 curves in the graph, although discovering that number from the graph is not easy. The labels that identify the curves overlap and are impossible to read. You can see some trends, such as the fact that life expectancy has, on average, increased for these countries. You can also see some interesting features. Cambodia (a reddish color) experienced a dip in the 1970s. Rwanda (purple) and Sierra Leone (gold) experienced dips in the 1990s. Zimbabwe (light blue) experienced a big decline in the 2000s.

A lasagna plot visualizes these data more effectively. The following statements use the HEATMAP statement in PROC SGPLOT, which requires SAS 9.40M3:

title "Life Expectancy in Low Income Countries";
/* 1. Unsorted list of low-income countries */
ods graphics/ width=500px height=600px discretemax=10000;
proc sgplot data=LE;
   where Income=5;            /* extract the "low income" companies */
   format Country_Name $10.;  /* truncate country names */
   heatmap x=Year y=Country_Name/ colorresponse=Expected discretex
               colormodel=TwoColorRamp;
   yaxis display=(nolabel) labelattrs=(size=6pt) reverse;
   xaxis display=(nolabel) labelattrs=(size=8pt) fitpolicy=thin;
run;

The graph is shown at the top of this article. Each row is a country; each column is a year. The default two-color color ramp encodes the value of the response variable, which is the average life expectancy. There are 31 x 55 = 1705 tiles, so the image displays a lot of information without any overplotting. You can use the COLORMODEL= option on the HEATMAP statement to specify a custom color ramp.

Many rows have a light shade on the left and a darker shade to the right, which confirms the general upward trend of the response variable. Countries that experienced a period of decline in life expectancy (Cambodia, Rwanda, and Zimbabwe) have a region of white or light blue in the middle of the row. In some countries the life expectancy has been consistently low (Chad and Guinea-Bissau); in others it has been consistently high (Dem. People's Republic of Korea).

The lasagna plot is not perfect. Because the graph avoids overplotting, you need a lot of space to display the rows. This lasagna plot uses 600 vertical pixels to display 31 countries. That's about 16 pixels for each row after you subtract the space above and below the heat map. If you use a smaller font, you can reduce that to 10 or 12 pixels per row. However, even at 12 pixels per row you would need about 2500 pixels in the vertical direction to display all 207 countries in the data set. In contrast, the spaghetti plot displays an arbitrary number of (possibly undecipherable) lines in a smaller area.

Sorting rows of a lasagna plot

Alphabetical ordering is often not the most informative way to display the levels of a categorical variable. You can sort the countries in various ways: by the mean life expectancy, by the average rate of increase (slope), by geographical region, and so forth.

To demonstrate sorting, the following program uses the HEATMAPCONT subroutine in the SAS/IML language. The following statements read in data for 51 lower-middle income countries into a SAS/IML matrix. The statements read the original "wide" data whereas PROC SGPLOT required "long" data. Use the PALETTE function to create a custom color ramp for the heat map.

ods graphics / width=500px height=650px;
proc iml;
varName = "Y1960":"Y2014";
use LL2 where (Income=4);               /* read "lower-middle income" countries */
   read all var varName into X[rowname=Country_Name]; /* X = "wide" data matrix */
close LL2;
 
Names = putc(Country_Name, "$15.");            /* truncate names */
palette = "CXFFFFFF" || palette("YLORRD", 4);  /* palette from colorbrewer.org */
 
/* 2. Order rows by average life expectancy from 1960-2014 */
mean = X[,:];                   /* compute mean for each row */
call sortndx(idx, mean, 1, 1);  /* sort by first column, descending */
Sort1 = X[idx,];                /* use sorted data */
Names1 = Names[idx,];           /* and sorted names */
call heatmapcont(Sort1) xvalues=1960:2014 yvalues=Names1
                    displayoutlines=0 colorramp=palette
                    title="Life Expectancy Sorted by Average";
Lasagna plot of life expectancy. Created with SAS/IML.

The lasagna plot show the life expectancy for 51 countries. The countries are sorted according to mean life expectancy over the 55 years in the data.

The sorted visualization is superior to the unsorted version. You can easily pick out the countries that have the top life expectancy, such as former Soviet Union countries. you can easily see that the countries at the bottom of the list are mainly in Western and Southern Africa.

The countries that experienced dips in life expectancy contain a patch of white or pale yellow in the middle of the row. Zambia, Kenya, and Lesotho stand out. Countries that have dramatically improved their life expectancy are also evident. For example, the rows for Bhutan and Timor-Leste are white or yellow on the left and dark orange on the right, which indicates that life expectancy has greatly improved in the past 55 years for these countries.

In this heat map, missing values are assigned a gray color. Only two countries (West Bank and Gaza, Kosovo) have missing values for life expectancy.

Other ways to sort lasagna plots

Swihart et al (2010) discuss other sorting techniques. One alternate display is to sort each column independently. This is similar to displaying a sequence of box plots, one for each year, because it shows the distribution of the response variable for each year, aggregated over all countries. This display is accomplished by using the following SAS/IML statements to sort each column in the data matrix:

/* 3. Order each year to see distribution of life expectancy for each year */
Sort2 = X;                   /* copy original data */
do i = 1 to ncol(X);
   v = X[,i];                /* extract i_th column */
   call sort(v, 1, 1);       /* sort i_th column descending */
   Sort2[,i] = v;            /* put sorted column into matrix */ 
end;
call heatmapcont(Sort2) xvalues=1960:2014
                    displayoutlines=0 colorramp=palette
                    title="Life Expectancy Sorted for each Year";
Lasagna plot sorted by mean response. Created with SAS/IML.

In this graph, each tile still represents a country, but you don't know which country. The purpose of the graph is to show how the distribution of the response variable has changed over time. You can see that in 1960 most countries had a life expectancy of 55 years or less, as shown by the vertical strip of mostly yellow tiles for 1960. Only one country in 1960 had an average life expectancy that approached 70 years (red).

Decades later, the vertical strips contain fewer yellow tiles and more orange and red tiles. The strip for 1990 contains mainly orange or dark orange tiles, which indicates that most countries have an average life expectancy of 55 years or greater. For 2014, more than half of the vertical strip is dark orange or red, which indicates that most countries have an average life expectancy of 65 years or greater.

Summary

A "lasagna plot" is a heat map in which each row represents a subject such as a country, a patient, or a company. The columns usually represent time, and the colors for each tile indicate the value of a response variable. The lasagna plot is useful when the response is measured for all subjects at the same set of uniformly spaced time points.

In contrast to spaghetti plots, there is no overplotting in a lasagna plot. However, each subject requires some minimal number of vertical pixels (usually 10 or more) so lasagna plots are not suitable for visualizing thousands of subjects. Lasagna plots are most effective when you sort the subjects by some characteristic, such as the mean response value.

Much more can be said about lasagna plots. I recommend the paper "A Method for Visualizing Multivariate Time Series Data," (Peng, JSS, 2008), which discusses techniques for standardizing the data, smoothing the data, and discretizing the response variable. Peng's article includes panel displays, which can be created in SAS by using the GTL and PROC SGRENDER.

Post a Comment

How to write CONTRAST and ESTIMATE statements in SAS regression procedures

I got several positive comments about a recent tip, "How to fit a variety of logistic regression models in SAS." A reader asked if I knew any other similar resources about statistical analysis in SAS.

Absolutely! One gem that comes to mind is "Examples of writing CONTRAST and ESTIMATE statements." SAS statistical programmers often ask how to write CONTRAST and ESTIMATE statements on discussion forums such as the SAS Support Community for Statistical Procedures.

The Knowledge Base article features regression models that you might encounter in PROC GLM, PROC LOGISTIC, and PROC GENMOD. The article includes the following topics:

  • How to express certain hypotheses as linear combinations of regression coefficients.
  • Why you must know the order of parameters for classification variables to properly write CONTRAST and ESTIMATE statements.
  • How to write CONTRAST and ESTIMATE statements for interaction effects.
  • How to specify linear combinations that include fractions like 1/3 or 1/6 that cannot be expressed as a terminating decimal value.
  • How to estimate or test contrasts of log odds in logistic models that use either GLM or EFFECT (deviation from the mean) encodings.
  • How to use the CONTRAST statement to compare nested models.

The article is written for a technical audience, and the examples are complex. However, if you are statistically sophisticated analyst, then "Examples of Writing CONTRAST and ESTIMATE Statements" is an excellent tutorial. Bookmark this page in case you need it!

And if you still have questions after reading the article, remember that the SAS Support Community is just a click away.

Post a Comment

Create spaghetti plots in SAS

What is a spaghetti plot? Spaghetti plots are line plots that involve many overlapping lines. Like spaghetti on your plate, they can be hard to unravel, yet for many analysts they are a delicious staple of data visualization. This article presents the good, the bad, and the messy about spaghetti plots and shows how to create basic and advanced spaghetti plots in SAS.

The data set in this article contains World Bank data about the average life expectancy (at birth) for more than 200 countries. The data are for the years 1960–2014. For convenience, I have attached two CSV files, one that contains the life expectancy data and another that contains country information. You can also download the SAS file that creates all the graphs in this article.

Line charts versus spaghetti charts

A line plot displays a continuous response variable at multiple time points. The response variable might be the price of a stock, the temperature in a city, or the blood pressure of a patient. When there are only a handful of stocks, cities, or patients, you can display multiple lines on the same plot and use labels, colors, or patterns to distinguish the individual units. For example, the following call to PROC SGPLOT creates a line plot of the life expectancy at birth for 10 countries, plotted for the years 1960–2014:

title "Life Expectancy at Birth";
proc sgplot data=LE;
   where country_name in ("China" "Chad" "Croatia" "Israel"
      "Kosovo" "Kuwait" "India" "Peru" "Sudan" "United States");
   series x=Year y=Expected / group=Country_name break curvelabel;
run;
Spaghetti plot: Life expectancy in 10 countries versus time

The line plot enables you to easily track the rise and fall of life expectancy over time for each of these 10 countries. The labels for Kuwait and Peru overlap, but otherwise the line plot is easy to read and interpret. You can see the general trend, which is increasing life expectancy for all countries. You can see that richer nations tend to have higher life expectancy than poorer nations. You can also see that Israel is missing several data points in the early 1960s, and that Kosovo is missing data prior to 1981.

The fact that the labels for Peru and Kuwait overlap is sign of that the line plot is starting to transition to a spaghetti plot. As you increase the number of curves in a line plot, more labels will overlap and it becomes harder to distinguish colors and to trace a country's curve from beginning to end. By the time you plot 40 or 50 countries, the time series plot has become a tangled mess that earns the moniker "spaghetti plot."

Serving up a spaghetti plot in SAS

When there are many individual countries (or stocks or patients), the line plot no longer reveals the behavior of each individual unit, but still can reveal trends. If the units can be classified into a small number of categories, colors can be used to visualize whether trends differ between groups. For example, patients might be grouped into cohorts, such as males versus females or control versus experimental groups.

For the life expectancy data, an Income variable records the relative wealth of each country. The following call to PROC SGPLOT creates a spaghetti plot that contains lines for 207 countries.

title "Life Expectancy at Birth for 207 Countries";
proc sgplot data=LE;
   series x=Year y=Expected / group=Country_Name grouplc=Income break 
        transparency=0.7 lineattrs=(pattern=solid)
        tip=(Country_Name Income Region);
   xaxis display=(nolabel);
   keylegend / type=linecolor title="";
run;
Spaghetti plot: Life expectancy in 207 countries versus time, colored by income categories

The 207 countries are classified into five groups according to economic factors. Thirty-two nations are wealthy and belong to the Organization for Economic Co-operation and Development (OECD). Forty-two wealthy countries are not OECD members. Fifty-one countries are upper-middle income, 51 are lower-middle income, and 31 are low income. Those categories are used to assign colors to the curves. The GROUPLC= option (available in SAS 9.4M2) is used to color the lines by the five levels of the Income variable.

Transparency is used to reduce overplotting. The TIP= option is used to add tooltips that display information about a country when you hover the mouse pointer over the HTML version of the graph in SAS. Notice also that BREAK option is used to break a line if there is a missing value; otherwise the line segments will connect across missing values.

Although the spaghetti plot contains about 207*54 ≈ 11,000 line segments, you can still deduce certain trends. You can see that the highest life expectancy belongs to OECD nations whereas the lowest belongs to low-income nations. You can see that the average trend is increasing, and that the slope looks greater for the low-income nations. You can see that some curves had negative slopes in the 1980s and 1990s, and that several had precarious dips. By using the tool tips, you can identify Cambodia (1970s) and Rwanda (1990s) as two countries whose life expectancy dipped to alarmingly low levels.

Paneled spaghetti plots

In spite of revealing general trends, it is hard to identify individual countries. The problem is that there are too many overlapping curves. This is a weakness of spaghetti plots: they do not reveal the identity of individuals, except for extreme outliers. If you see an interesting feature of some curve in the middle of the plot, you will have a hard time figuring out which country it is.

One possible alternative is draw the curves for each category in a separate panel rather than overlaying the categories. This reduces the number of curves in any one plot. For the World Bank data, you can use the BY statement in PROC SGPLOT to create full-sized plots of each Income level, or you can use the SGPANEL procedure to create five cells, each with 30 to 50 curves. Again, you can use transparency and tool tips to help discern individual "noodles" among the five spaghetti plots. If the data has additional categories, you can color the line segments in each panel. The following statement create a panel of spaghetti plots where each plot is now colored by a categorical variable (Region) that encodes the country's geographic region.

proc sgpanel data=LE;
   panelby Income / columns=3 onepanel sparse;
   series x=Year y=Expected / group=Country_Name break transparency=0.5
       grouplc=region lineattrs=(pattern=solid)
       tip=(Country_Name Region);
   colaxis grid display=(nolabel) offsetmin=0.05 fitpolicy=stagger;
   keylegend / type=linecolor title="";
run;
Panel of spaghetti plots. Panels indicate income of countries, line plots show life expectancy versus time for countries, colored by geographic region

The panels show that the "High OECD" countries are primarily European countries. The upper- and lower-middle income panels are a mixture of countries from diverse geographic locations. The "Low Income" countries are primarily in sub-Saharan Africa, with a few in South Asia.

Because each panel contains fewer curves, it is easier to identify the outliers. For example, the life expectancy in Equatorial Guinea is low relative to other high-income non-OECD countries. Angola has a low life expectancy relative to other upper-middle-income countries. North Korea reports high life expectancy relative to other low-income countries.

By using the tool tips, you can discover the names of countries that experienced drops in life expectancy due to conflict or environmental disasters. You can display additional variables in the tool tips.

Alternative displays

In summary, the spaghetti plot shows major trends over time. If you color by a categorical variable, you can visually compare cohorts. However, the spaghetti plot has some problems. When the number of curves becomes large (typically more than 20 or 30), it becomes difficult to distinguish individual curves. Although using semi-transparent lines helps to reduce overplotting, the plot becomes virtually unreadable when you have 200 or more curves.

There are some modifications that you use to handle overplotting. This article shows a paneling approach, in which cohorts are plotted in separate panels.

Another option is to increase the interactivity of the plot. The Flowing Data website has a nice interactive version of these data. In SAS you could use interactive software such as JMP or SAS/IML Studio to create a similar interactive plot for exploratory data analysis.

Some visualization experts choose a different visualization for multiple time series. Instead of plotting a response versus time, you can show changes in time by using an animation. For example, Hans Rosling famously used an animated bubble plot to show the relationship between life expectancy and average income over time. Sanjay Matange showed how to use PROC SGPLOT to create Rosling's animated bubble plot in SAS.

For an alternative visualization of this time series data, you might consider lasagna plots.

Post a Comment

Grids and linear subspaces

parallelepiped

A grid is a set of evenly spaced points. You can use SAS to create a grid of points on an interval, in a rectangular region in the plane, or even in higher-dimensional regions like the parallelepiped shown at the left, which is generated by three vectors. You can use vectors, matrices, and functions in the SAS/IML language to obtain a grid in a parallelepiped.

Evenly spaced points in an interval

In one dimension, you can use a DO loop in the SAS DATA step to construct evenly spaced points in an interval. In the SAS/IML language, you can construct a vector of evenly spaced values operation without writing a DO loop. You can use the DO function, or you can use the colon operator (which SAS/IML documentation calls the "index creation operator"). The results are equivalent, as shown in the following program:

proc iml;
a = 2; b = 5; N = 5;
delta = (b-a) / N;
y = do(a, b, delta);    /* Method 1: DO function: start=1, stop=b, step=delta */
x = a +  delta*(0:N);   /* Method 2: Use colon operator and vector sum */
print y, x;
span2d1

Subdivide a vector subspace

The second method generalizes to any linear subspace. For example, suppose you are given a vector u with magnitude α = ||u||. Then the following SAS/IML statements create points that are evenly spaced points along the vector u. The magnitudes of the N vectors are 0, α/N, 2α/N, and so forth up to α.

/* linear spacing along a vector */
u = {3 2 1};
delta = u / N;         /* Note: delta is vector */
w = delta @ T(0:N);    /* i_th row is (i-1)*u */
print w;
span2d2

It's always exciting to use the Kronecker product operator (@). The Kronecker product operator multiplies delta by 0, 1, 2, ..., N and stacks the results. Notice that the ith column of the w matrix is a linear progression from 0 to the ith component of u. Each row is a vector in the same direction as u.

Grids in a linear subspace

It is only slightly harder to generate a grid of points in two or higher dimensions. In the DATA step, you can use nested DO loops to generate a grid of points in SAS. In the SAS/IML language, you can use the EXPANDGRID function.

In fact, with the help of linear algebra, the EXPANDGRID function enables you to construct a grid in any linear subspace. Recall that a linear subspace is the "span," or set of all linear combinations, of k basis vectors.

From a linear algebra perspective, a rectangular grid is a set of discrete, evenly spaced, linear combinations of the standard Cartesian basis vectors e1, e2, ... ek. In a linear algebra course you learn that a matrix multiplication enables you to change to another set of basis vectors.

To give a concrete example, consider a rectangular grid of points in the square [0,1]x[0,1]. In the following SAS/IML program, the rows of the matrix G contain the grid points. You can think of each row as being a set of coefficients for a linear combination of the basis vectors e1 and e2. You can use those same coefficients to form linear combinations of any other basis vectors. The program shows how to form a grid in the two-dimensional linear subspace that is spanned by the vectors u = { 3 2 1} and v = {-3 5 2}:

h = 1/N;
/* generate linear combinations of basis vectors e1 and e2 */
G = ExpandGrid( do(0,1,h),  do(0,1,h) );  
 
u = { 3 2 1};
v = {-3 5 2};
M = u // v;      /* create matrix of new basis vectors */
grid = G * M;    /* grid = linear combinations of basis vectors u and v */
span2d

In the graph, the three-dimensional vectors u and v are indicated. The grid of points is a set of linear combinations of the form (i/N) u + (j/N) v, where 0 ≤ i, j ≤ N. The grid lies on a two-dimensional subspace spanned by u and v. The points in the grid are colored according to the values of the third component: blue points (lower left corner) correspond to low z-values whereas red points (upper right corner) correspond to high z-values.

Grids are important for visualizing multivariate functions. The function in this example is linear, but grids also enable you to visualize and understand nonlinear transformations.

Post a Comment

Compute the square root matrix

Children in primary school learn that every positive number has a real square root. The number x is a square root of s, if x2 = s.

Did you know that matrices can also have square roots? For certain matrices S, you can find another matrix X such that X*X = S. To give a very simple example, if S = a*I is a multiple of the identity matrix and a > 0, then X = ±sqrt(a)*I is a square root matrix.

A matrix square root

I'm going to restrict this article to real numbers, so from now on when I say "a number" I mean a real number and when I say "a matrix" I mean a real matrix. Negative numbers do not have square roots, so it is not surprising that not every matrix has a square root. For example, the negative identity matrix (-I) does not have a square root matrix.

All positive numbers have square roots, and mathematicians, who love to generalize everything, have defined a class of matrices with properties that are reminiscent of positive numbers. They are called positive definite matrices, and they arise often in statistics because every covariance and correlation matrix is symmetric and positive definite (SPD).

It turns out that if S is a symmetric positive definite matrix, then there exists a unique SPD matrix X, (called the square root matrix) such that X2 = A. For a proof, see Golub and van Loan, 3rd edition, 1996, p. 149. Furthermore, the following iterative algorithm converges quadratically to the square root matrix (ibid, p. 571):

  1. Let X0=I be the first approximation.
  2. Apply the formula X1 = (X0 + S*X0-1) / 2 where X-1 is the inverse matrix of X. The matrix X1 is a better approximation to sqrt(S) than X0.
  3. Apply the formula Xn+1 = (Xn + S*Xn-1) / 2 iteratively until the process converges. Convergence is achieved when a matrix norm ∥ Xn+1 – Xn ∥ is as small as you like.

The astute reader will recognize this algorithm as the matrix version of the Babylonian method for computing a square root. As I explained last week, this iterative method implements Newton's method to find the roots of the (matrix) function f(X) = X*X - S.

Compute a matrix square root in SAS

In SAS, the SAS/IML matrix language is used to carry out matrix computations. To illustrate the square root algorithm, let S be the 7x7 Toeplitz matrix that is generated by the vector {4 3 2 1 0 -1 -2}. I have previously shown that this Toeplitz matrix (and others of this general form) are SPD. The following SAS/IML program implements the iterative procedure for computing the square root of an SPD matrix:

proc iml;
/* Given an SPD matrix S, this function to compute the square root matrix
   X such that X*X = S */
start sqrtm(S, maxIters=100, epsilon=1e-6);
   X = I(nrow(S));             /* initial starting matrix */
   do iter = 1 to maxIters while( norm(X*X - S) > epsilon );
      X = 0.5*(X + S*inv(X));  /* Newton's method converges to square root of S */
   end;
   if norm(X*X - S) <= epsilon then  return X;
   else  return(.);
finish;
 
S = toeplitz(4:-2);            /* 7x7 SPD example */
X = sqrtm(S);                  /* square root matrix */
print X[L="sqrtm(S)" format=7.4];
sqrtMatrix

The output shows the square root matrix. If you multiply this matrix with itself, you get the original Toeplitz matrix. Notice that the original matrix and the square root matrix can contain negative elements, which shows that "positive definite" is different from "has all positive entries."

Functions of matrices

The square root algorithm can be thought of as a mapping that takes an SPD matrix and produces the square root matrix. Therefore it is an example of a function of a matrix. Nick Higham (2008) has written a book Functions of Matrices: Theory and Computation, which contains many other functions of matrices (exp(), log(), cos(), sin(),....) and how to compute the functions accurately. SAS/IML supports the EXPMATTRIX function, which computes the exponential function of a matrix.

The square root algorithm in this post is a simplified version of a more robust algorithm that has better numerical properties. Higham (1986), "Newton's Method for the Matrix Square Root" cautions that Newton's iteration can be numerically unstable for certain matrices. (For details, see p. 543, Eqns 3.11 and 3.12.) Higham suggests an alternate (but similar) routine (p. 544) that is only slightly more expensive but has improved stability properties.

I think it is very cool that the ancient Babylonian algorithm for computing square roots of numbers can be generalized to compute the square root of a matrix. However, notice that there is an interesting difference. In the Babylonian algorithm, you are permitted to choose any positive number to begin the square root algorithm. For matrices, the initial guess must commute with the matrix S (Higham, 1986, p. 540). Therefore a multiple of the identity matrix is the safest choice for an initial guess.

Post a Comment

How to fit a variety of logistic regression models in SAS

SAS software can fit many different kinds of regression models. In fact a common question on the SAS Support Communities is "how do I fit a <name> regression model in SAS?" And within that category, the most frequent questions involve how to fit various logistic regression models in SAS.

There are many types of logistic regression models:

  • The traditional logistic model has a binary (or binomial) response variable.
  • Well-known extensions of the logistic model include ordinal regression (for an ordinal response) and multinomial regression (for a discrete but unordered response).
  • Special models handle situations such as repeated measures (longitudinal data) or random effects.
  • Other models include conditional logistic regression, survey logistic regression, Bayesian logistic regression, and fractional logistic regression (which gets the Coolest Name Award).

If you know the procedure that you want to use, you can read the procedure documentation and follow the examples to perform the analyses. In practice, however, you might know the reverse information: You know the analysis that you want to perform, but you do not know which SAS procedure implements that regression model.

I was therefore pleased to discover a short SAS Knowledge Base article titled "Types of logistic (or logit) models that can be fit using SAS." This article provides a "reverse look-up": Given the type of logistic regression model that you want to fit, it provides the name of the SAS procedures that support that analysis!

Use this resource the next time you need to know which SAS procedure can conduct a certain variant of logistic regression. Bookmark it, print it out, tattoo it on your forearm, or come to my blog and type "HOW TO FIT A LOGISTIC MODEL" into the search box. But don't forget it! This resource is a treasure for the SAS statistical programmer.

Post a Comment

All I really need to know about Newton's method I learned in primary school

I was eleven years old when I first saw Newton's method.

No, I didn't go to a school for geniuses. I didn't even know it was Newton's method until decades later. However, in sixth grade I learned an iterative algorithm that taught me (almost) everything I need to know about Newton's method for finding the roots of functions.

Newton's method and an iterative square root algorithm

The algorithm I learned in the sixth grade is an iterative procedure for computing square roots by hand. It seems like magic because it estimates a square root from an arbitrary guess. Given a positive number S, you can guess any positive number x0 and then apply the "magic formula" xn+1 = (xn + S / xn)/2 until the iterates converge to the square root of S. For details, see my article about the Babylonian algorithm.

It turns out that this Babylonian square-root algorithm is a special case of Newton's general method for finding the roots of functions. To see this, define the function f(x) = x2 - S. Notice that the square root of S is the positive root of f. Newton's method says that you can find roots by forming the function Q(x) = x - f(x) / f′(x), where f′ is the derivative of f, which is 2x. With a little bit of algebra, you can show that Q(x) = (x + S / x)/2, which is exactly the "magic formula" in the Babylonian algorithm.

Therefore, the square root algorithm is exactly Newton's method applied to a special quadratic polynomial whose root is the square root of S. The Newton iteration process is visualized by the following graph (click to enlarge), which shows the iteration history of the initial guess 10 when S = 20.

Iteration of initial guess under Babylonian square-root algorithm

Everything I need to know about Newton's method...

It turns out that almost everything I need to know about Newton's method, I learned in sixth grade. The Babylonian method for square roots provides practical tips about how to use Newton's method for finding roots:

  • You need to make an initial guess.
  • If you make a good initial guess, the iteration process is shorter.
  • When you get close to a solution, the number of correct digits doubles for each iteration. This is quadratic convergence in action!
  • Avoid inputs where the function is not defined. For the Babylonian method, you can't plug in x=0. In Newton's method, you need to avoid inputs where the derivative is close to zero.
  • Different initial guesses might iterate to different roots. In the square root algorithm, a negative initial guess converges to the negative square root.

In fact, I'll go further. The Babylonian method teaches important lessons in numerical analysis:

  • Root-finding is a fundamental tool that helps you solve all kinds of numerical problems.
  • Many problems that do not have exact answers can be solved with iterative methods.
  • Iterative methods often linearize the problem and use the linear formulation as part of the algorithm.

So you see, all I really needed to know about Newton's method I learned in sixth grade!

Post a Comment