SAS/IML functions that operate on columns of a matrix

A SAS programmer asked for a list of SAS/IML functions that operate on the columns of an n x p matrix and return a 1 x p row vector of results. The functions that behave this way tend to compute univariate descriptive statistics such as the mean, median, standard deviation, and quantiles.

The following SAS/IML functions compute a statistic for each column of a matrix:

  • COUNTMISS function: counts the number of missing values (use the "col" option)
  • COUNTN function : counts the number of nonmissing values (use the "col" option)
  • COUNTUNIQUE function: counts the number of unique values (use the "col" option)
  • CV function: computes the sample coefficient of variation
  • KURTOSIS function: computes the sample kurtosis
  • MAD function: computes the univariate median absolute deviation
  • MEAN function: computes sample means
  • QNTL call: computes sample quantiles (returns multiple rows)
  • QUARTILE function: computes the five-number summary (returns five rows: min, Q1, Q2, Q3, and max))
  • SKEWNESS function: computes the sample skewness
  • STD function: computes the sample standard deviation
  • VAR function: computes the sample variance

You can also use subscript reduction operators to compute the following descriptive statistics for each column in a matrix X:

  • X[+, ] returns the column sums (similar to the SUM function)
  • X[#, ] returns the column products (similar to the PROD function)
  • X[<>, ] returns the column maxima (similar to the MAX function)
  • X[><, ] returns the column minima (similar to the MIN function)
  • X[<:>, ] returns the index of the column maxima
  • X[>:<, ] returns the index of the column minima
  • X[##, ] returns the column sum of squares (similar to the SSQ function)

Lastly, elementwise operators (+, -, #, /, ##) in SAS/IML can operate on columns.

Post a Comment

Five reasons to use ODS EXCLUDE to suppress SAS output

I previously wrote about the best way to suppress output from SAS procedures. Suppressing output is necessary in simulation and bootstrap analyses, and it is useful in other contexts as well. In my previous article, I wrote, "many programmers use ODS _ALL_ CLOSE as a way to suppress output, but I do not recommend that you use the ODS CLOSE statement for this purpose." This article expands on that statement.

I can think of five reasons why using ODS EXCLUDE ALL is preferable to using ODS _ALL_ CLOSE for suppressing output from procedures. The ODS EXCLUDE statement is

  1. Easy. Put ODS EXCLUDE ALL before a procedure call. Put ODS EXCLUDE NONE after call. Done.
  2. Portable. It works regardless of which destinations are open.
  3. Reversible. You can easily undo the operation and send future output to all open ODS destinations.
  4. Repeatable. You can use it many times within a single SAS program without introducing side effects.
  5. Safe. It does not delete, overwrite, or change existing ODS output files.

Let me be perfectly clear: The ODS CLOSE statement is an important statement. It is used to close files that display SAS output. For many destinations (such as PDF and RTF), the output does not display until the destination is closed. However, when you want to temporarily suppress output, the ODS EXCLUDE statement is easier to use than ODS CLOSE.

A detailed look at ODS EXCLUDE

So that we can discuss the advantages of each method, let's write two programs. The first uses ODS EXCLUDE to temporarily suspend SAS output:

/* PROGRAM 1: The preferred approach */
ods graphics off;                        /* no graphics today */
ods html; ods rtf;                       /* open some ODS destinations */
title "ODS EXCLUDE: First output";
proc print data=Sashelp.Cars(obs=5); var make MPG_City; run;
/* Use ODS EXCLUDE to temporarily suppress output */
ods exclude all;                        /* suspend all open destinations */
/* INSERT PROCEDURE CALL HERE. Output is suppressed */
ods exclude none;                      /* resume output to open destinations */
title "ODS EXCLUDE: Second output";
proc print data=Sashelp.Class(obs=5); var Name Age; run;
ods _all_ close;                       /* DONE with program. Display results */

This program opens the HTML and PDF destinations. Behind the scenes, SAS creates two files to accumulate the results, which on my system are called sashtml.htm and sasrtf.rtf. Neither file is affected by the ODS EXCLUDE statements. When the program ends, the ODS CLOSE statements write a postamble to the files, closes them, and might even launch a viewer to display the output. At the end of the program, each file contains the results of both PROC PRINT calls.

The calls are ODS EXCLUDE calls are certainly easy. I do not need to know which destinations are open, which makes the technique portable. The call to ODS EXCLUDE NONE reverses the effect of ODS EXCLUDE ALL. I can include this ODS EXCLUDE "sandwich" many times in the same program, which makes the technique repeatable. Lastly, at the end of the program when I want to see the final output, there is one file for each destination, and both files are correct.

I like this technique because it requires minimal effort to get exactly the behavior I want. If you prefer inclusion to exclusion, feel free to use ODS SELECT NONE followed by ODS SELECT ALL. It is equivalent.

A detailed look at ODS CLOSE (and "OPEN")

The second program uses ODS CLOSE to close all ODS destinations. It then calls a SAS procedure (not shown). Before the program can generate more ODS output, it must reopen some ODS destinations, which is where the trouble begins:

/* PROGRAM 2: Not recommended */
ods graphics off;                        /* no graphics today */
ods html; ods rtf;                       /* open some ODS destinations */
title "ODS CLOSE: First output";
proc print data=Sashelp.Cars(obs=5); var make MPG_City; run;
/* Use ODS CLOSE to close all open destinations */
ods _all_ close;                         /* close all open destinations */
/* INSERT PROCEDURE CALL HERE. Output is suppressed */
ods html; ods rtf;                       /* reopen some ODS destinations */
title "ODS EXCLUDE: Second output";
proc print data=Sashelp.Class(obs=5); var Name Age; run;
ods _all_ close;                       /* DONE with program. Display results */

This program opens the HTML and PDF destinations. Behind the scenes, SAS creates two files to accumulate the results. The first ODS CLOSE statement write a postamble to the files, closes them, and might even launch a viewer to display the output.

When you want to resume output, you need to open some ODS destinations. All information about the previously open destinations is gone, so I hope that you wrote down which destinations were open before you closed them! This program reopens the HTML and PDF destinations, and SAS creates two files, which on my system are called sashtml1.htm and sasrtf.rtf. Notice that SAS has chosen a new name for the html file, but the same name for the RTF file. That means that the first RTF file has just been overwritten!

The remainder of the program is uneventful. The newly opened HTML and RTF files contain the output for the second PROC PRINT statement. They are properly closed (and potentially displayed) at the end of the program.

The ODS CLOSE technique as I've presented it has a few problems. Although you can CLOSE all destinations without knowing which are open, you must have knowledge of which destinations were previously open before you can reopen them. Also, the technique creates side effects. Although some destinations (like LISTING) can be opened and closed with impunity without creating additional physical files on disk, others must close and reopen a file. Obviously, if you repeat this process multiple times, the files proliferate.

Notice also that this straw-man program overwrites the RTF file that contained the output from the first portion of the program. You can use the FILE= option to specify a file name, which will ensure that all of your results are preserved. However, my point is that this second technique requires more typing, more knowledge about the state of ODS, more knowledge about how various destinations create files, and so forth.

What about the LISTING destination?

In the old days, when almost everyone used the LISTING destination, ODS LISTING CLOSE was a common technique for suppressing output. You see it in many old SAS Global Forum (née SUGI) papers that were written by excellent programmers. I've used it myself. The technique worked well because LISTING was the default destination in SAS.

However, it is time to retire the ODS LISTING CLOSE statement from your programming toolbox when your intention is to temporarily suspend output. The modern SAS programmer does not exclusively use the LISTING destination. It is not even the default destination when you run SAS in most GUI environments. If you are writing a program or macro that you intend to share with others, the program should work with whatever destinations are open at the time. The ODS EXCLUDE ALL statement is a simple approach that works for all destinations.


Can a careful programmer use the second technique (closing all destinations) successfully? Sure. It requires some effort, but the problems can be overcome.

However, the ODS EXCLUDE technique is simpler, safer, and can easily be repeated many times within a single program. Use it when you want to temporarily suspend output, and you intend to resume sending output later in the program. For even more control over suspending SAS output, use the %ODSOff and %ODSOn macros.

Post a Comment

What is the best way to suppress ODS output in SAS?

SAS procedures can produce a lot of output, but you don't always want to see it all. In simulation and bootstrap studies, you might analyze 10,000 samples or resamples. Usually you are not interested in seeing the results of each analysis displayed on your computer screen. Instead, you want to write the results to an output data set so that they can be further analyzed. Thus it is useful to know how to suppress the ODS tables (and graphics!) that ordinarily appear on the screen (and other ODS destinations).

I have previously written about the importance of turning off ODS when running simulations. There are three ways to suppress ODS output in a SAS procedure: the NOPRINT option, the ODS EXCLUDE statement, and the ODS CLOSE statement. This article compares the various ways in terms of efficiency, ease of use, and portability. Some of this material is taken from Chapter 6 (p. 97-100) of Simulating Data with SAS (Wicklin, 2013).

The NOPRINT option

About 50 procedures in SAS/STAT support a NOPRINT option in the PROC statement. Additional procedures in Base SAS, SAS/ETS, and other products also support this option. The NOPRINT option is useful when the procedure supports an OUTPUT statement, an OUT= option, an OUTEST= option, or some other syntax for producing an output data set that contains statistics that you want.

The NOPRINT option is the most efficient way to suppress output because it tells SAS procedures (through the grammar) not to produce any tables or graphs. That means that the procedure can optimize—or even skip!—certain computations. The NOPRINT option tells the procedure that the only statistics needed are those that are sent to the output data set. In some cases the procedure can run faster because it can skip certain computations.

The following example uses the NOPRINT option to suppress output to any ODS destination. The OUTP= option creates an output data set that contains the correlation coefficients between all numerical variables in the Sashelp.Cars data for each of the 38 values of the categorical variable Make:

proc corr outp=OutCorr NOPRINT;
by make;  /* no VAR stmt ==> use all numeric variables */

For this example, the output data set does not contain any p-values for the null hypothesis that the correlation coefficient equals zero. Therefore the procedure can skip the computation of those p-values. In contrast, p-values are computed and shown by default when PROC CORR produces its usual "PearsonCorr" table, which includes the Pearson correlation coefficients.

The ODS EXCLUDE statement

Unfortunately, not all procedures support the NOPRINT option. Furthermore, data sets that are created by using the OUTPUT statement or an OUT= option do not contain all statistics that can be produced by a procedure. If the numbers that you need are only available in a table, then you cannot use the NOPRINT option.

Instead, you should use ODS to suppress output to all open destinations while also using ODS to create an output data set from the table that contains the numbers that you need. You can use the ODS EXCLUDE statement to suppress output. Then you can use the ODS OUTPUT statement to specify the name of the ODS table (or tables) that you want to capture and the name of the data set that will contain the values in the table. Notice that ODS EXCLUDE does not close a destination, it merely prevents the destination from receiving output objects.

Is ODS EXCLUDE equivalent to the NOPRINT option? No. The procedure has no idea whether any ODS destinations are open. It computes all of the tables and graphics for the analysis, and then hands them off to ODS. ODS manages the output, which in this case means "throwing away" all the tables and graphs that were excluded, and only sending the non-excluded tables to their destination.

I like to use the %ODSOff and %ODSOn macros (and sometimes OPTIONS NONOTES) to suppress the output. However, to focus the discussion on the ODS EXCLUDE statement, the following statements explicitly show the ODS statements that are used to control the output. The following call to the TTEST procedure performs 38 t tests, one for each value of the Make variable. Each test is a one-sided t test of the null hypothesis that the mean value of the MPG_City variable is greater than 20.

Prior to calling the TTEST procedure, the ODS GRAPHICS OFF statement turns off the creation of graphics for subsequent procedure calls. The ODS EXCLUDE ALL statement suppresses all output to all ODS destinations. When the TTEST procedure runs, it creates all its usual tables, but ODS throws away most of them. The exception is the "TTests" table, which ODS sends to the OUTPUT destination. The result is an output data set named OutTTest. After the procedure runs, ODS EXCLUDE NONE tells SAS that future ODS objects should be sent to all open destinations.

ods graphics off;            /* or use the %ODSOff macro */
ods exclude all;             /* suspend all open destinations */
/* perform one-sided t test H0: mean(MPG_City) > 20 */
proc ttest data=Sashelp.Cars h0=20 sides=u; 
by make;
var MPG_City;
ods output TTests=OutTTest;  /* open the OUTPUT destination; select one table */
ods exclude none;            /* or use the %ODSOn macro */
proc print data=OutTTest;
where probt ^=. & probt < 0.05;

The PROC PRINT statement displays the vehicle brands for which the mean fuel efficiency is significantly greater than 20 miles per gallon at the 0.05 significance level. The brands in the list are known for producing small fuel-efficient vehicles.

An equivalent syntax is to use ODS SELECT instead of ODS EXCLUDE. To temporarily suppress ODS output, you create an "ODS sandwich": use ODS SELECT NONE before calling a procedure, and use ODS SELECT ALL when you want to resume sending output to the suspended destinations.

The ODS CLOSE statement

The ODS CLOSE statement is used to close files (HTML, PDF, RTF,...) that contain tabular and graphical output from SAS procedures. I see many programmers use ODS _ALL_ CLOSE as a way to suppress output, but I do not recommend that you use the ODS CLOSE statement for this purpose. This post is already a little long, so I will postpone the discussion of ODS CLOSE for a future article.

Can you combine NOPRINT and ODS OUTPUT?

SAS programmers crave efficiency. Upon reading that the NOPRINT option can make a procedure run faster, the ambitious programmer might attempt to run a procedure with the NOPRINT option but use the ODS OUTPUT statement to capture the results of one table. Sorry, friend, but you can't do that. The NOPRINT option means that no ODS tables are created, so there is no way to select a table and save it to a data set. For example, the following call to PROC REG uses the NOPRINT option. The ODS OUTPUT option tries to create an output data set from the "ParemeterEstimates" table, but it results in the following WARNING message:

proc reg data=sashelp.class NOPRINT;
model weight = height age;
ods output ParameterEstimates=PE;     /* WRONG: no ODS table created */
WARNING: Output 'ParameterEstimates' was not created.  Make sure that the output object
         name, label, or path is spelled correctly.  Also, verify that the appropriate
         procedure options are used to produce the requested output object.  For example,
         verify that the NOPRINT option is not used.

The WARNING message is very instructive. It gives several possible reasons why the output data set was not created. The use of the NOPRINT option is one of those reasons.


In summary:

  • If a procedure supports the NOPRINT option AND can create an output data set that contains the statistic that you need, use this combination of options. Check the procedure documentation to see what statistics are supported by the OUTPUT statement, the OUT= option, the OUTEST= option, and so forth.
  • If the statistic that you need is available only in an ODS table, or if the procedure does not support the NOPRINT option, use the ODS EXCLUDE ALL statement to suppress output from all open destinations. Use the ODS OUTPUT statement to specify tables that you want to save to SAS data sets. In this situation, I also recommend that you specify ODS GRAPHICS OFF or use the %ODSOff and %ODSOn macros.
  • Do not use ODS _ALL_ CLOSE to close all destinations. Although this does prevent future ODS output from appearing, the ODS EXCLUDE technique is more portable and easier to implement, as I will discuss in a future article.
Post a Comment

Direct me to the nuclear Bessels!

When I was an undergraduate physic major, my favorite professor would start each class with a joke or pun. One day he began class with a paraphrase of a famous quote from the movie Star Trek 4: The Voyage Home (the one with the whales). "Today," my professor said, imitating Ensign Chekov's Russian accent, "I vill direct you to the nuclear Bessels!" (See the video at the end of this article.)

I've known for a long time that SAS software supports Bessel functions, which are used in physics to solve differential equations, but it was only recently that I started seeing examples of statistical distributions that involved Bessel functions as part of the formulas. (Do an internet search for "Bessel function" and "Bayesian inference.") Accordingly, I thought I'd mention that SAS has functions to compute Bessel functions. The JBESSEL function enables you to compute Bessel functions of the first kind. The IBESSEL function enables you to compute modified Bessel functions of the first kind. There are no built-in functions for computing the (less common) Bessel functions or modified Bessel function of the second or third kind.

The SAS/IML language also includes the JROOT function, which returns the roots of Bessel functions.

The family of Bessel functions includes a parameter, called the order of the Bessel function. The most important Bessel functions have orders that are small positive integers, which appear as subscripts. The following DATA step computes the values of J0, J1, and J2, which are Bessel functions the first kind of order 0, 1, and 2 (respectively) on the domain [0, 15]:

data Bessel;
do alpha = 0 to 2;
   name = cat("J", put(alpha, 1.));
   do x = 0.0 to 15 by 0.05;
      J = jbessel(alpha, x);
title "Bessel Functions of the First Kind";
title2 "alpha = 0, 1, 2";
proc sgplot data=Bessel;
refline 0 / axis=y;
series x=x y=J / group=Name curvelabel 
       curvelabelloc=outside curvelabelpos=auto;
xaxis grid;
yaxis grid label="y";

You can see that the Bessel functions of the first kind look like damped sinusoidal functions, but the roots are not evenly spaced.

The modified Bessel functions of the first kind (or order α) are represented by the symbol Iα(x). The following DATA step evaluates Iα for α = 0, 1, 2, 3, on the domain [0, 4]:

data ModBessel;
do alpha = 0 to 3;
   name = cat("I", put(alpha, 1.));
   do x = 0.0 to 4 by 0.05;
      I = ibessel(alpha, x, 0);
title "Modified Bessel Functions of the First Kind";
title2 "alpha = 0, 1, 2, 3";
proc sgplot data=ModBessel;
series x=x y=I / group=Name curvelabel 
       curvelabelloc=outside curvelabelpos=auto;
xaxis grid;
yaxis grid max=3.5 label="y";

The modified Bessel functions grow exponentially. In some applications, the modified Bessel functions appear as part of the expression exp(x)Iα(x). In these applications, you can call the IBESSEL function as ibessel(alpha, x, 1), where the last argument specifies that the modified Bessel function should be multiplied by a decaying exponential function.

By the way, the German mathematician Friedrich Bessel (for whom Bessel functions are named) not only contributed to differential equations, but to astronomy and statistics. In astronomy, he was the first to use parallax to find the distance to celestial objects. In statistics, we remember Bessel every time that we compute a standard deviation. Bessel introduced the now-standard method of using n-1 in the denominator of the estimate of the sample variance. This is called "Bessel's correction," and it adjusts the estimate due to the fact that the sample mean is used as part of the computation of the sample variance.

If you've somehow missed seeing Star Trek 4: The Voyage Home, here is the clip where Chekov is trying to find the naval base "where they store the nuclear 'wessels'." The movie came out in 1986, which was before the fall of the Soviet Union.

Post a Comment

Improving the NC vehicle inspection pie chart

ncinspection North Carolina is a state that requires yearly inspections of motor vehicles. An inspection checks for safety features (lights, brakes, tires,....) as well as checking vehicle emissions to ensure that vehicles meet air pollution standards. I recently had a car inspected and noticed a pie chart on the inspection's summary sheet. A scanned version of the pie chart is shown at the left. Three aspects of the pie chart bothered me. First, there is no title or legend that tells you what is being plotted. Second, some of the slivers in the pie chart are very thin and hard to see. Third, I and other members of the #OneLessPie club are always looking for ways to visualize data without using pie charts. At first I thought the chart had something to do with my vehicle's emissions. Maybe different kinds of exhaust gasses? However, I found the same values in a table elsewhere on the form and learned that the values represent dollars. The cost of the inspection is $30, and these values represent the amounts that are given to various agencies, as shown in the following SAS data set:
data NCInspection;
input Cost Agency $24.;
format cost dollar6.2;
23.75 Inspection Station
 3.00 Emissions Program
 1.75 Telecommunication
 0.65 Division of Air Quality
 0.55 Highway Fund
 0.18 Volunteer Rescue/EMS
 0.12 Rescue Squad Relief
The Rescue Squad Relief money is "to financially assist any rescue or EMS worker" who has been injured or killed on the job. The Volunteer Rescue/EMS fund is to provide equipment for volunteer firefighters and rescue squads. After seeing the information in a table, I wondered why additional ink was wasted creating a pie chart. There are only seven numbers. It seems like a table is an adequate way to show how the $30 is distributed. However, it turns out that the NC General Assembly has mandated that the information must appear as a table and "shall be shown graphically in the form of a pie chart on the inspection receipt." Wow! A pie chart mandated by a state legislature! Maybe my #OneLessPie friends should start a letter-writing campaign! Although the pie chart is mandated, it might be fun to examine alternative visualizations. Here are a few ideas:
  • A cascade chart is a great way to show the contributions of each component to the full amount. See my previous article about how to create a cascade chart in SAS.
  • A bar chart is a standard way to visualize data in a pie chart.
  • A stacked bar chart is an alternative that takes up less space, but labeling those small slivers will be a challenge.
Let's see what each alternative looks like in SAS. I'll create the examples in color, although any graph on the NC inspection statement would have to be printed in black and white. The following statements use the SGPLOT procedure to create a cascade chart:
title "NC Inspection Amounts";
title2 "Cascade Chart (Waterfall Chart)";
proc sgplot data=NCInspection;
  waterfall category=Agency response=Cost / datalabel finalbartickvalue="Total Cost";
  xaxis display=(nolabel);
  yaxis grid display=(nolabel) offsetmin=0;
ncinspection1 I like this chart. (Click to enlarge.) It shows relative sizes for each categories, as well as the exact values. However, because the cascade chart is not used very much in the media, I think that it is not the best choice for a general audience. A second possibility is a classic bar chart, which is familiar to everyone. However, because the scale of the largest category ($23.75) is almost eight time larger than the next largest category ($3.00), the small bars will be hard to see. One variation on the bar chart is to use a broken axis to make the smaller values more visible. In SAS 9.4m2 you can use the RANGES= option on the XAXIS statement to create a broken axis, as follows:
title "Bar Chart with Broken Axis";
proc sgplot data=NCInspection;
  hbar Agency / response=Cost categoryorder=respdesc datalabel ;
  xaxis grid offsetmin=0 ranges=(0-3.8 21.2-24) values=(0 to 24) display=(nolabel);
  yaxis display=(nolabel);
ncinspection2 In general, I am not a fan of using a broken axis because it changes the area of the "broken" bars, but it isn't a terrible choice for these data. Nevertheless, the "broken axis" might confuse some people, so a standard bar chart (not shown) is probably a better choice. (For a technical audience you could consider using a logarithmic scale, but obviously that is not an option when the graph is intended for the general public.) A third alternative is a stacked bar chart. This chart type is very familiar, but those small bars will be hard to label. After trying several variations, I finally settled on using the following options for creating a stacked bar chart:
  • The stacked bar chart requires two categorical variables: one for the X axis and one for the stacking groups. Thus you need to create a fake category if you want only a single stacked bar chart.
  • Although the new SEGLABEL option in the VBAR and VBARPARM statements in SAS 9.4m2 supports placing labels inside bars, the small bars are too thin to use this feature. Instead, you can use the VBARPARM statement to create the stacked bars and the YAXISTABLE statement to display labels for the bars. You need to use VBARPARM instead of VBAR because only certain plot types can be combined with others by using the SGPLOT procedure.
  • The heights of the labels have to be customized for these data. For many data sets, you can set the heights at the midpoints of the bars, but you need to adjust the heights for these data so that the text does not overlap.
data Inspect / view=Inspect;
length Label $32.;
set NCInspection;
FakeGroup = 1;                /* fake group for category */
Label = trim(Agency) || " = " || putn(Cost, "Dollar6.2"); /* concatenate */
cumCost + Cost;
if _N_ <= 3 then                         /* customize placement of labels */
   LabelHeight = cumCost - Cost/2;       /* midpoint */
else if _N_ <= 5 then 
   LabelHeight = cumCost - Cost + 0.15;  /* bottom + epsilon*/
else if _N_ = 6 then 
   LabelHeight = cumCost - Cost;         /* bottom */
else LabelHeight = cumCost + 0.12;       /* top */   
ods graphics / width=350px height=800px;
title2 "Stacked Bar Chart";
proc sgplot data=Inspect noautolegend;
  vbarparm category=FakeGroup response=Cost / barwidth=1
       group=Agency grouporder=data groupdisplay=stack;
  yaxistable Label / Y=LabelHeight location=inside nolabel;
  yaxis offsetmin=0 values=(0 to 30 by 5);
  xaxis offsetmin=0.05 offsetmax=0.05 display=(nolabel noticks novalues);
ncinspection3 If the pie chart were not mandated by law, which graph would you like to see on every vehicle inspection form in North Carolina? Do you think that you can create a better visualization? Leave a comment.
Post a Comment

No major hurricanes have hit the US coast recently. Lucky us!

Perhaps you saw the headlines earlier this week about the fact that it has been nine years since the last major hurricane (category 3, 4, or 5) hit the US coast. According to a post on the GeoSpace blog, which is published by the American Geophysical Union (AGU), researchers ran a computer model that "simulated the years 1950 through 2012 a thousand times to learn how often... virtual hurricanes managed to cross into 19 U.S. states ranging from Texas to Maine." Based on the results of their simulation study, the researchers concluded that it is highly unusual to have a nine-year period without major hurricanes: "Their analysis shows that the mean wait time before a nine-year hurricane drought occurred is 177 years." From that statement, I infer that the probability is 1/177 = 0.56% of having a nine-year "drought" in which major hurricanes do not hit the US.

Undoubtedly the simulation was very complex, since it incorporated sea temperatures, El Niño phenomena, and other climatic conditions. Is there an easy way to check whether their computation is reasonable? Yes! You can use historical data about major hurricanes to build a simple statistical model.

The National Hurricane Center keeps track of how many hurricanes of each Saffir-Simpson category hit the US mainland. The following SAS DATA step presents the data for each decade from 1851–2000:

data MajorUS;
input decade $10. Count;
1851-1860  6 
1861-1870  1 
1871-1880  7 
1881-1890  5 
1891-1900  8 
1901-1910  4 
1911-1920  7 
1921-1930  5 
1931-1940  8 
1941-1950 10 
1951-1960  8 
1961-1970  6 
1971-1980  4 
1981-1990  5 
1991-2000  5 
proc means data=MajorUS N sum mean var;
var Count;

The call to PROC MEANS shows that 89 storms hit the coastal US in those 15 decades. That produces an average rate of λ = 5.93 major storms per decade. The very simplest statistical model of events that occur over time is a homogeneous Poisson process, which assumes that a major storm hits the US at a constant rate of 5.93 times per decade. Obviously that model is at best an approximation, since in reality El Niño and other factors affect the rate.

Nevertheless, this simple statistical model produces a result that is consistent with the more sophisticated computer model. In the Poisson model, we expect 0.9*5.93 = 5.337 major storms to hit land every nine years. Then the probability of observing zero storms in a nine-year period is the Poisson density evaluated at 0:

/* During a 9 year period, we expect 0.9*5.9333333 major hurricanes. */
/* What is probability of 0 major hurricanes in a nine-year period? */
data _null_;
   lambda = 0.9*5.9333333;
   p = pdf("Poisson", 0, lambda);  /* P(X=0) */
   put p;

According to the model, a nine-year "drought" of major hurricanes occurs less than 0.5% of the time. That means that it will occur about once every 200 years. Recall that the model used by the climate scientists estimated 0.56% and 177 years.

How rare is this "drought" of major hurricanes at landfall? It is more unusual than tossing a fair coin and having it land on heads seven times in a row: 0.57 = 0.78%.

Considering that you can do the Poisson calculations on a hand calculator in about two minutes, the estimates are amazingly close to the estimates from the more sophisticated simulation model. Although a simple statistical model is not a replacement for meteorological models, it is a simple way to show the rarity of the current nine-year drought.

Postscript: Someone asked me how well these data fit a Poisson distribution. You can use the GENMOD procedure to fit a Poisson distribution in SAS. The procedure computes goodness-of-fit statistics that show that a Poisson model is a reasonable fit. For example, the deviance statistic is 12.97 which, for 14 degrees of freedom, give a p-value of 0.53 for the chi-square test of deviations from the Poisson model. Therefore the chi-square test does not reject the possibility that these data are Poisson distributed.

Personally, I often prefer graphs to p-values. The following graph overlays the Poisson model on the data. The fit is okay, but not stellar.

Post a Comment

Two-stage sampling of hierarchical data in SAS

I always learn something new when I attend SAS Global Forum, and it often results from an informal conversations in The Quad or in a hallway. Last week a SAS customer described a scenario that he wanted to implement as part of an analysis of some genetic data. To oversimplify the problem, his data contained nested categorical variables. He wanted to randomly sample several values from the outer variable and then randomly sample values of the inner variable from within the selected levels of the outer variable.

I don't know much about genetics, but this kind of sampling is used in survey sampling. For example, a researcher can randomly select states and then randomly select counties within those states. Or a researcher can randomly select schools and teachers within those schools to assess classroom performance. In each case, the data are hierarchical because of the nested variables.

The customer suspected that he could use PROC SURVEYSELECT for this sampling scenario, but neither of us had experience with this sampling scheme. However, an email exchange with a SAS colleague provided the details. I saw the customer again at the Kickback Party on the last night of SAS Global Forum. While the band played in the ballroom, we found a quiet hallway and I described how to use the SURVEYSELECT procedure to select a random sample for this kind of nested data.

Two-stage sampling by using PROC SURVEYSELECT

The trick is to call the SURVEYSELECT procedure twice. I will illustrate the technique by using the SasHelp.Zipcode data, which contains counties (the inner variable) that are nested within states (the outer variable). For this example, the task will be to randomly select three states, then to randomly select two counties for each of the selected states.

The following call to PROC SURVEYSELECT randomly selects three states:

proc surveyselect data=sashelp.Zipcode noprint sampsize=3
                  seed=54321 out=ClusterSample;
   cluster State;        /* or use PPS option for nonuniform probability */
proc freq data=ClusterSample;
   tables StateName / nocum nopercent;

The SURVEYSELECT procedure creates an output data set named ClusterSample that contains 1,247 observations. Each observation corresponds to a zip code. The output from PROC FREQ shows that the selected states were Mississippi (which contains 532 zip codes), North Dakota (which contains 406 zip codes), and Vermont (which contains 309 zip codes). By using the CLUSTER statement, the procedure outputs all observations for each selected state.

This output data set becomes the input data set for a second call to PROC SURVEYSELECT. In the second call, you can use the STRATA statement to specify that you want to select from within each of the three states. You can use the CLUSTER statement yet again to extract all observations for a random sample of two counties from each state:

proc surveyselect data=ClusterSample noprint sampsize=2
                  seed=65432 out=StrataSample;
   strata State;
   cluster County;
proc freq data=StrataSample;
   tables StateName*CountyNM / list nocum nopct norow nocol;

You can see that the procedure selected Jefferson and Lowndes counties for Mississippi. There are 15 zip codes in those two counties. For North Dakota, Emmons and Ransom counties were selected, which contain 10 zip codes. For Vermont, Lamoille and Rutland counties were selected, which contain 43 zip codes.

You can use all of the usual SURVEYSELECT options to produce multiple samples (REPS= option) or to change the method for sample selection (METHOD= option). Details are provided in the documentation for the SURVEYSELECT procedure.

Two-Stage sampling in SAS/IML software

The SURVEYSELECT procedure is the best choice for implementing this hierarchical sampling scheme. However, it is an interesting programming exercise to implement this two-stage sampling in the SAS/IML language. You can use the SAMPLE function to sample without replacement, and combine that technique with the LOC-ELEMENT trick. As explained in my previous post, the following statements generate the row numbers for a random selection of three states. The row numbers are then used to form a new matrix that contains only the selected states. This matrix is analogous to the output that is produced by the first call to PROC SURVEYSELECT.

proc iml;
use sashelp.Zipcode;
read all var _NUM_  into X[colname=nVarNames];
read all var _CHAR_ into C[colname=cVarNames];
close sashelp.Zipcode;
call randseed(54321);
uState = unique(X[, "State"]);       /* state numbers in sorted order */
SampState = sample(uState, 3, "NoReplace");        /* choose 3 states at random */
ClusterIdx = loc(element(X[, "State"],SampState)); /* row numbers for states */
/* form ClusterSample = subset of selected states */
X = X[ClusterIdx,];  mattrib X colname=nVarNames;
C = C[ClusterIdx,];  mattrib C colname=cVarNames;

For the next stage of sampling, you can use a DO loop to iterate over the states. For each state, the following statements randomly select two counties. The LOC-ELEMENT trick is used to generate the row numbers, and a concatenation operator is used to accumulate the row numbers for the six selected counties:

/* for each selected state, select 2 counties */
free SelIdx;
do i = 1 to ncol(SampState);                    
   stateIdx = loc(X[, "State"] = SampState[i]);       /* get obs for i_th state */
   uCounty = unique(X[stateIdx, "County"]);           /* all counties for state */
   SampCounty = sample(uCounty, 2, "NoReplace");      /* choose 2 counties at random */        
   countyIdx = loc(element(X[, "County"],SampCounty)); /* row numbers for counties */
   SelIdx = SelIdx || xsect(stateIdx, countyIdx);
v = {"StateName" "CountyNM"};
results = trim(C[selIdx, v]) || putn(X[SelIdx,"Zip"], "Z5.");
print results[c=(v||"Zip")];

The experienced programmer will notice that concatenating vectors inside a loop is not optimally efficient. I didn't want to over-complicate this program by worrying about efficiency, so I will leave it to the motivated reader to rewrite the program without using concatenation. The REMOVE function might be useful for this task.

In summary, you can perform cluster sampling of hierarchical data by calling the SURVEYSELECT procedure twice. If you need to perform cluster sampling for data that are in a SAS/IML matrix, PROC IML provides the SAMPLE function and the LOC-ELEMENT trick, which enable you to write your own algorithm to extract random samples from nested data.

Post a Comment

Finding observations that satisfy multiple conditions: The LOC-ELEMENT technique

A common task in data analysis is to locate observations that satisfy multiple criteria. For example, you might want to locate all zip codes in certain counties within specified states.

The SAS DATA step contains the powerful WHERE statement, which enables you to extract a subset of data that satisfy certain criteria. The Sashelp.Zipcode data set contains all US zip codes and the state and county in which they are located. The following DATA step uses the IN operator to create a subset of zip codes for Seminole County (FL), Wake County (NC), Tompkins County (NY), and Hennepin County (MN):

data Subset;
set sashelp.Zipcode;
where StateCode in ("FL" "NC" "NY" "MN") & 
      CountyNM in ("Seminole" "Wake" "Tompkins" "Hennepin");
The previous code will extract all observations that satisfy the specified conditions. If "Seminole" is the name of a county in any of those four states, the observation will be included. For this set of states, each county name matches only one state.

Finding rows that match column values

The SAS/IML language also supports the WHERE clause when reading from a SAS data set, but sometimes the data are in matrices in RAM. What options exist for finding observations in matrices? In this article I'd like to introduce the LOC-ELEMENT technique.

I've previously written about how to use the LOC function in SAS/IML to locate certain elements in a matrix that satisfy a single logical expression. However, when you are searching for observations that satisfy one of several conditions, there is a nice trick that you can use: combine the LOC function with the ELEMENT function.

The ELEMENT function is one of several functions in SAS/IML that enable you to form queries about sets, such as their union, intersection, and difference. The ELEMENT function returns a 0/1 vector that indicates whether each element in a matrix is contained in a specified set of values. The following SAS/IML statements read character variables into a matrix and then use the ELEMENT and LOC functions to find the row numbers that correspond to any observations for which the StateCode variable is in the set {FL, NC, NY, MN}:

proc iml;
use sashelp.Zipcode;
read all var {"StateCode" "CountyNM"} into C[colname=varNames];
read all var {"Zip"};
close sashelp.Zipcode;
myStates = {FL NC NY MN};                             /* target set */
StateIdx = loc(element(C[, "StateCode"], myStates));  /* find matches */

The StateIdx variable contains the row numbers that correspond to all zip codes in the four specified states. To get the actual zip codes, you can form the expression zip[StateIdx].

The previous statements contain two tricks that are worth studying. The first is that loc(element(A,B)) finds the indices of the elements in the matrix A that are contained in the set B. The second is that when you use the READ INTO statement, the SAS/IML language automatically assigns matrix attributes to the matrix, so that you can use syntax such as C[, "StateCode"] to refer to the column of C that contains the state codes. For more details on this feature, see the article "Access rows or columns of a matrix by names."

In a similar way, you can find all row numbers for the four specified counties:

myCounties = {"Seminole" "Wake" "Tompkins" "Hennepin"};
CountyIdx = loc(element(C[, "CountyNM"], myCounties));

Notice that these row numbers are independent of the states. In particular, it turns out that Georgia (GA) and Oklahoma (OK) both have counties named "Seminole." You can call the XSECT function to form the intersection of these two sets of row numbers, as follows:

idx = xsect(StateIdx, CountyIdx);
print (C[idx,])[colname=varNames] (zip[idx])[colname="Zip"];

The LOC-ELEMENT technique is superior to the alternative, which is to form a long sequence of logical expressions, as follows:

code = C[, "StateCode"]; 
StateIdx2 = loc( code="FL" | code="NC" | code="NY" | code="MN" );

The logical expression isn't too hard to write for four states, but imagine wanting to extract 20 states or hundreds of counties! The LOC-ELEMENT technique is more compact and less error-prone. Try it; I think you'll like it.

Post a Comment

Matrix computations at SAS Global Forum 2015

Last week I attended SAS Global Forum 2015 in Dallas. It was packed with almost 5,000 attendees. I learned many interesting things at the conference, including the fact that you need to arrive EARLY to the statistical presentations if you want to find an empty seat!

It was gratifying to see that matrix programming with SAS/IML software is alive and well in the SAS statistical community. I was impressed with the number and quality of presentations that featured SAS/IML programs. I attended the following talks or e-posters:

Of course, SAS/IML was included in my presentation Ten Tips for Simulating Data with SAS, which was very well attended. The SAS/IML language was also used in several e-posters and presentations that unfortunately I could not attend due to other commitments. All of the papers for the proceedings are available online, so you can search the proceedings for "IML" or whatever other topic interests you.

It was energizing to listen to and talk with dozens of SAS customers. The conference demonstrated that the SAS/IML language is actively used for optimization, integration, simulation, and matrix computations, which are topics that I blog about frequently. I discussed SAS/IML or general statistical topics with dozens of customers during meals, at coffee breaks, in the hallways, and in The Quad.

At the Opening Session, I learned that the free SAS University Edition has been downloaded more than 250,000 times in the 11 months that it has been available. Because SAS/IML is part of the SAS University Edition, I expect the number of SAS/IML users to continue to grow. Experienced SAS programmers recognize that when an analysis is not available in any SAS procedure, the SAS/IML language is often the best choice for implementing the algorithm.

Post a Comment

An easy way to clear your SAS titles

Did you know that if you have set multiple titles in SAS, that there is an easy way to remove them?

For example, suppose that you've written the following statements, which call the TITLE statement to set three titles:

title "A Great Big Papa Title";
title2 "A Medium-sized Mama Title";
title3 "A Wee Little Baby Title";
proc sgplot data=Sashelp.Cars;
   vbar Origin;

Now suppose that you want to remove the values of the TITLEn statements. I have always used three statements to remove the values for the TITLE, TITLE2, and TITLE3 statements, as follows:

title;     /* cancel the TITLE  statement */
title2;    /* cancel the TITLE2 statement */
title3;    /* cancel the TITLE3 statement */

However, a SAS colleague informed me that there is an easier way. It turns out that a TITLEn statement clears titles at level n and higher. That is, you do not need to submit a blank TITLE2 and TITLE3 statements. The TITLE statement is sufficient to remove the first-level title as well as all higher-order titles, so you can write the following statement:

title;     /* cancel the TITLE, TITLE2, and TITLE3 statements */

Oh, and in case you are wondering, the same tip applies to the FOOTNOTE statement. When you cancel a FOOTNOTE statement, all higher-order footnotes are also canceled.

Do you know any cool tips for setting or clearing global SAS statements? Share a tip in the comments.

Post a Comment