Direct me to the nuclear Bessels!

When I was an undergraduate physic major, my favorite professor would start each class with a joke or pun. One day he began class with a paraphrase of a famous quote from the movie Star Trek 4: The Voyage Home (the one with the whales). "Today," my professor said, imitating Ensign Chekov's Russian accent, "I vill direct you to the nuclear Bessels!" (See the video at the end of this article.)

I've known for a long time that SAS software supports Bessel functions, which are used in physics to solve differential equations, but it was only recently that I started seeing examples of statistical distributions that involved Bessel functions as part of the formulas. (Do an internet search for "Bessel function" and "Bayesian inference.") Accordingly, I thought I'd mention that SAS has functions to compute Bessel functions. The JBESSEL function enables you to compute Bessel functions of the first kind. The IBESSEL function enables you to compute modified Bessel functions of the first kind. There are no built-in functions for computing the (less common) Bessel functions or modified Bessel function of the second or third kind.

The SAS/IML language also includes the JROOT function, which returns the roots of Bessel functions.

The family of Bessel functions includes a parameter, called the order of the Bessel function. The most important Bessel functions have orders that are small positive integers, which appear as subscripts. The following DATA step computes the values of J0, J1, and J2, which are Bessel functions the first kind of order 0, 1, and 2 (respectively) on the domain [0, 15]:

data Bessel;
do alpha = 0 to 2;
   name = cat("J", put(alpha, 1.));
   do x = 0.0 to 15 by 0.05;
      J = jbessel(alpha, x);
      output;
   end;
end;
run;
 
title "Bessel Functions of the First Kind";
title2 "alpha = 0, 1, 2";
proc sgplot data=Bessel;
refline 0 / axis=y;
series x=x y=J / group=Name curvelabel 
       curvelabelloc=outside curvelabelpos=auto;
xaxis grid;
yaxis grid label="y";
run;
bessel

You can see that the Bessel functions of the first kind look like damped sinusoidal functions, but the roots are not evenly spaced.

The modified Bessel functions of the first kind (or order α) are represented by the symbol Iα(x). The following DATA step evaluates Iα for α = 0, 1, 2, 3, on the domain [0, 4]:

data ModBessel;
do alpha = 0 to 3;
   name = cat("I", put(alpha, 1.));
   do x = 0.0 to 4 by 0.05;
      I = ibessel(alpha, x, 0);
      output;
   end;
end;
run;
 
title "Modified Bessel Functions of the First Kind";
title2 "alpha = 0, 1, 2, 3";
proc sgplot data=ModBessel;
series x=x y=I / group=Name curvelabel 
       curvelabelloc=outside curvelabelpos=auto;
xaxis grid;
yaxis grid max=3.5 label="y";
run;
modbessel

The modified Bessel functions grow exponentially. In some applications, the modified Bessel functions appear as part of the expression exp(x)Iα(x). In these applications, you can call the IBESSEL function as ibessel(alpha, x, 1), where the last argument specifies that the modified Bessel function should be multiplied by a decaying exponential function.

By the way, the German mathematician Friedrich Bessel (for whom Bessel functions are named) not only contributed to differential equations, but to astronomy and statistics. In astronomy, he was the first to use parallax to find the distance to celestial objects. In statistics, we remember Bessel every time that we compute a standard deviation. Bessel introduced the now-standard method of using n-1 in the denominator of the estimate of the sample variance. This is called "Bessel's correction," and it adjusts the estimate due to the fact that the sample mean is used as part of the computation of the sample variance.

If you've somehow missed seeing Star Trek 4: The Voyage Home, here is the clip where Chekov is trying to find the naval base "where they store the nuclear 'wessels'." The movie came out in 1986, which was before the fall of the Soviet Union.



Post a Comment

Improving the NC vehicle inspection pie chart

ncinspection North Carolina is a state that requires yearly inspections of motor vehicles. An inspection checks for safety features (lights, brakes, tires,....) as well as checking vehicle emissions to ensure that vehicles meet air pollution standards. I recently had a car inspected and noticed a pie chart on the inspection's summary sheet. A scanned version of the pie chart is shown at the left. Three aspects of the pie chart bothered me. First, there is no title or legend that tells you what is being plotted. Second, some of the slivers in the pie chart are very thin and hard to see. Third, I and other members of the #OneLessPie club are always looking for ways to visualize data without using pie charts. At first I thought the chart had something to do with my vehicle's emissions. Maybe different kinds of exhaust gasses? However, I found the same values in a table elsewhere on the form and learned that the values represent dollars. The cost of the inspection is $30, and these values represent the amounts that are given to various agencies, as shown in the following SAS data set:
data NCInspection;
input Cost Agency $24.;
format cost dollar6.2;
datalines;
23.75 Inspection Station
 3.00 Emissions Program
 1.75 Telecommunication
 0.65 Division of Air Quality
 0.55 Highway Fund
 0.18 Volunteer Rescue/EMS
 0.12 Rescue Squad Relief
;
The Rescue Squad Relief money is "to financially assist any rescue or EMS worker" who has been injured or killed on the job. The Volunteer Rescue/EMS fund is to provide equipment for volunteer firefighters and rescue squads. After seeing the information in a table, I wondered why additional ink was wasted creating a pie chart. There are only seven numbers. It seems like a table is an adequate way to show how the $30 is distributed. However, it turns out that the NC General Assembly has mandated that the information must appear as a table and "shall be shown graphically in the form of a pie chart on the inspection receipt." Wow! A pie chart mandated by a state legislature! Maybe my #OneLessPie friends should start a letter-writing campaign! Although the pie chart is mandated, it might be fun to examine alternative visualizations. Here are a few ideas:
  • A cascade chart is a great way to show the contributions of each component to the full amount. See my previous article about how to create a cascade chart in SAS.
  • A bar chart is a standard way to visualize data in a pie chart.
  • A stacked bar chart is an alternative that takes up less space, but labeling those small slivers will be a challenge.
Let's see what each alternative looks like in SAS. I'll create the examples in color, although any graph on the NC inspection statement would have to be printed in black and white. The following statements use the SGPLOT procedure to create a cascade chart:
title "NC Inspection Amounts";
title2 "Cascade Chart (Waterfall Chart)";
proc sgplot data=NCInspection;
  waterfall category=Agency response=Cost / datalabel finalbartickvalue="Total Cost";
  xaxis display=(nolabel);
  yaxis grid display=(nolabel) offsetmin=0;
run;
ncinspection1 I like this chart. (Click to enlarge.) It shows relative sizes for each categories, as well as the exact values. However, because the cascade chart is not used very much in the media, I think that it is not the best choice for a general audience. A second possibility is a classic bar chart, which is familiar to everyone. However, because the scale of the largest category ($23.75) is almost eight time larger than the next largest category ($3.00), the small bars will be hard to see. One variation on the bar chart is to use a broken axis to make the smaller values more visible. In SAS 9.4m2 you can use the RANGES= option on the XAXIS statement to create a broken axis, as follows:
title "Bar Chart with Broken Axis";
proc sgplot data=NCInspection;
  hbar Agency / response=Cost categoryorder=respdesc datalabel ;
  xaxis grid offsetmin=0 ranges=(0-3.8 21.2-24) values=(0 to 24) display=(nolabel);
  yaxis display=(nolabel);
run;
ncinspection2 In general, I am not a fan of using a broken axis because it changes the area of the "broken" bars, but it isn't a terrible choice for these data. Nevertheless, the "broken axis" might confuse some people, so a standard bar chart (not shown) is probably a better choice. (For a technical audience you could consider using a logarithmic scale, but obviously that is not an option when the graph is intended for the general public.) A third alternative is a stacked bar chart. This chart type is very familiar, but those small bars will be hard to label. After trying several variations, I finally settled on using the following options for creating a stacked bar chart:
  • The stacked bar chart requires two categorical variables: one for the X axis and one for the stacking groups. Thus you need to create a fake category if you want only a single stacked bar chart.
  • Although the new SEGLABEL option in the VBAR and VBARPARM statements in SAS 9.4m2 supports placing labels inside bars, the small bars are too thin to use this feature. Instead, you can use the VBARPARM statement to create the stacked bars and the YAXISTABLE statement to display labels for the bars. You need to use VBARPARM instead of VBAR because only certain plot types can be combined with others by using the SGPLOT procedure.
  • The heights of the labels have to be customized for these data. For many data sets, you can set the heights at the midpoints of the bars, but you need to adjust the heights for these data so that the text does not overlap.
data Inspect / view=Inspect;
length Label $32.;
set NCInspection;
FakeGroup = 1;                /* fake group for category */
Label = trim(Agency) || " = " || putn(Cost, "Dollar6.2"); /* concatenate */
cumCost + Cost;
if _N_ <= 3 then                         /* customize placement of labels */
   LabelHeight = cumCost - Cost/2;       /* midpoint */
else if _N_ <= 5 then 
   LabelHeight = cumCost - Cost + 0.15;  /* bottom + epsilon*/
else if _N_ = 6 then 
   LabelHeight = cumCost - Cost;         /* bottom */
else LabelHeight = cumCost + 0.12;       /* top */   
run;
 
ods graphics / width=350px height=800px;
title2 "Stacked Bar Chart";
proc sgplot data=Inspect noautolegend;
  vbarparm category=FakeGroup response=Cost / barwidth=1
       group=Agency grouporder=data groupdisplay=stack;
  yaxistable Label / Y=LabelHeight location=inside nolabel;
  yaxis offsetmin=0 values=(0 to 30 by 5);
  xaxis offsetmin=0.05 offsetmax=0.05 display=(nolabel noticks novalues);
run;
ncinspection3 If the pie chart were not mandated by law, which graph would you like to see on every vehicle inspection form in North Carolina? Do you think that you can create a better visualization? Leave a comment.
Post a Comment

No major hurricanes have hit the US coast recently. Lucky us!

Perhaps you saw the headlines earlier this week about the fact that it has been nine years since the last major hurricane (category 3, 4, or 5) hit the US coast. According to a post on the GeoSpace blog, which is published by the American Geophysical Union (AGU), researchers ran a computer model that "simulated the years 1950 through 2012 a thousand times to learn how often... virtual hurricanes managed to cross into 19 U.S. states ranging from Texas to Maine." Based on the results of their simulation study, the researchers concluded that it is highly unusual to have a nine-year period without major hurricanes: "Their analysis shows that the mean wait time before a nine-year hurricane drought occurred is 177 years." From that statement, I infer that the probability is 1/177 = 0.56% of having a nine-year "drought" in which major hurricanes do not hit the US.

Undoubtedly the simulation was very complex, since it incorporated sea temperatures, El Niño phenomena, and other climatic conditions. Is there an easy way to check whether their computation is reasonable? Yes! You can use historical data about major hurricanes to build a simple statistical model.

The National Hurricane Center keeps track of how many hurricanes of each Saffir-Simpson category hit the US mainland. The following SAS DATA step presents the data for each decade from 1851–2000:

data MajorUS;
input decade $10. Count;
datalines;
1851-1860  6 
1861-1870  1 
1871-1880  7 
1881-1890  5 
1891-1900  8 
1901-1910  4 
1911-1920  7 
1921-1930  5 
1931-1940  8 
1941-1950 10 
1951-1960  8 
1961-1970  6 
1971-1980  4 
1981-1990  5 
1991-2000  5 
;
proc means data=MajorUS N sum mean var;
var Count;
run;
t_hurricanepoisson

The call to PROC MEANS shows that 89 storms hit the coastal US in those 15 decades. That produces an average rate of λ = 5.93 major storms per decade. The very simplest statistical model of events that occur over time is a homogeneous Poisson process, which assumes that a major storm hits the US at a constant rate of 5.93 times per decade. Obviously that model is at best an approximation, since in reality El Niño and other factors affect the rate.

Nevertheless, this simple statistical model produces a result that is consistent with the more sophisticated computer model. In the Poisson model, we expect 0.9*5.93 = 5.337 major storms to hit land every nine years. Then the probability of observing zero storms in a nine-year period is the Poisson density evaluated at 0:

/* During a 9 year period, we expect 0.9*5.9333333 major hurricanes. */
/* What is probability of 0 major hurricanes in a nine-year period? */
data _null_;
   lambda = 0.9*5.9333333;
   p = pdf("Poisson", 0, lambda);  /* P(X=0) */
   put p;
run;
t_hurricanepoisson2

According to the model, a nine-year "drought" of major hurricanes occurs less than 0.5% of the time. That means that it will occur about once every 200 years. Recall that the model used by the climate scientists estimated 0.56% and 177 years.

How rare is this "drought" of major hurricanes at landfall? It is more unusual than tossing a fair coin and having it land on heads seven times in a row: 0.57 = 0.78%.

Considering that you can do the Poisson calculations on a hand calculator in about two minutes, the estimates are amazingly close to the estimates from the more sophisticated simulation model. Although a simple statistical model is not a replacement for meteorological models, it is a simple way to show the rarity of the current nine-year drought.

Postscript: Someone asked me how well these data fit a Poisson distribution. You can use the GENMOD procedure to fit a Poisson distribution in SAS. The procedure computes goodness-of-fit statistics that show that a Poisson model is a reasonable fit. For example, the deviance statistic is 12.97 which, for 14 degrees of freedom, give a p-value of 0.53 for the chi-square test of deviations from the Poisson model. Therefore the chi-square test does not reject the possibility that these data are Poisson distributed.

Personally, I often prefer graphs to p-values. The following graph overlays the Poisson model on the data. The fit is okay, but not stellar.

hurricanepoisson
Post a Comment

Two-stage sampling of hierarchical data in SAS

I always learn something new when I attend SAS Global Forum, and it often results from an informal conversations in The Quad or in a hallway. Last week a SAS customer described a scenario that he wanted to implement as part of an analysis of some genetic data. To oversimplify the problem, his data contained nested categorical variables. He wanted to randomly sample several values from the outer variable and then randomly sample values of the inner variable from within the selected levels of the outer variable.

I don't know much about genetics, but this kind of sampling is used in survey sampling. For example, a researcher can randomly select states and then randomly select counties within those states. Or a researcher can randomly select schools and teachers within those schools to assess classroom performance. In each case, the data are hierarchical because of the nested variables.

The customer suspected that he could use PROC SURVEYSELECT for this sampling scenario, but neither of us had experience with this sampling scheme. However, an email exchange with a SAS colleague provided the details. I saw the customer again at the Kickback Party on the last night of SAS Global Forum. While the band played in the ballroom, we found a quiet hallway and I described how to use the SURVEYSELECT procedure to select a random sample for this kind of nested data.

Two-stage sampling by using PROC SURVEYSELECT

The trick is to call the SURVEYSELECT procedure twice. I will illustrate the technique by using the SasHelp.Zipcode data, which contains counties (the inner variable) that are nested within states (the outer variable). For this example, the task will be to randomly select three states, then to randomly select two counties for each of the selected states.

The following call to PROC SURVEYSELECT randomly selects three states:

proc surveyselect data=sashelp.Zipcode noprint sampsize=3
                  seed=54321 out=ClusterSample;
   cluster State;        /* or use PPS option for nonuniform probability */
run;
 
proc freq data=ClusterSample;
   tables StateName / nocum nopercent;
run;
t_hiersamp1

The SURVEYSELECT procedure creates an output data set named ClusterSample that contains 1,247 observations. Each observation corresponds to a zip code. The output from PROC FREQ shows that the selected states were Mississippi (which contains 532 zip codes), North Dakota (which contains 406 zip codes), and Vermont (which contains 309 zip codes). By using the CLUSTER statement, the procedure outputs all observations for each selected state.

This output data set becomes the input data set for a second call to PROC SURVEYSELECT. In the second call, you can use the STRATA statement to specify that you want to select from within each of the three states. You can use the CLUSTER statement yet again to extract all observations for a random sample of two counties from each state:

proc surveyselect data=ClusterSample noprint sampsize=2
                  seed=65432 out=StrataSample;
   strata State;
   cluster County;
run;
 
 
proc freq data=StrataSample;
   tables StateName*CountyNM / list nocum nopct norow nocol;
run;
t_hiersamp2

You can see that the procedure selected Jefferson and Lowndes counties for Mississippi. There are 15 zip codes in those two counties. For North Dakota, Emmons and Ransom counties were selected, which contain 10 zip codes. For Vermont, Lamoille and Rutland counties were selected, which contain 43 zip codes.

You can use all of the usual SURVEYSELECT options to produce multiple samples (REPS= option) or to change the method for sample selection (METHOD= option). Details are provided in the documentation for the SURVEYSELECT procedure.

Two-Stage sampling in SAS/IML software

The SURVEYSELECT procedure is the best choice for implementing this hierarchical sampling scheme. However, it is an interesting programming exercise to implement this two-stage sampling in the SAS/IML language. You can use the SAMPLE function to sample without replacement, and combine that technique with the LOC-ELEMENT trick. As explained in my previous post, the following statements generate the row numbers for a random selection of three states. The row numbers are then used to form a new matrix that contains only the selected states. This matrix is analogous to the output that is produced by the first call to PROC SURVEYSELECT.

proc iml;
use sashelp.Zipcode;
read all var _NUM_  into X[colname=nVarNames];
read all var _CHAR_ into C[colname=cVarNames];
close sashelp.Zipcode;
 
call randseed(54321);
uState = unique(X[, "State"]);       /* state numbers in sorted order */
SampState = sample(uState, 3, "NoReplace");        /* choose 3 states at random */
ClusterIdx = loc(element(X[, "State"],SampState)); /* row numbers for states */
 
/* form ClusterSample = subset of selected states */
X = X[ClusterIdx,];  mattrib X colname=nVarNames;
C = C[ClusterIdx,];  mattrib C colname=cVarNames;

For the next stage of sampling, you can use a DO loop to iterate over the states. For each state, the following statements randomly select two counties. The LOC-ELEMENT trick is used to generate the row numbers, and a concatenation operator is used to accumulate the row numbers for the six selected counties:

/* for each selected state, select 2 counties */
free SelIdx;
do i = 1 to ncol(SampState);                    
   stateIdx = loc(X[, "State"] = SampState[i]);       /* get obs for i_th state */
   uCounty = unique(X[stateIdx, "County"]);           /* all counties for state */
   SampCounty = sample(uCounty, 2, "NoReplace");      /* choose 2 counties at random */        
   countyIdx = loc(element(X[, "County"],SampCounty)); /* row numbers for counties */
   SelIdx = SelIdx || xsect(stateIdx, countyIdx);
end;
 
v = {"StateName" "CountyNM"};
results = trim(C[selIdx, v]) || putn(X[SelIdx,"Zip"], "Z5.");
print results[c=(v||"Zip")];

The experienced programmer will notice that concatenating vectors inside a loop is not optimally efficient. I didn't want to over-complicate this program by worrying about efficiency, so I will leave it to the motivated reader to rewrite the program without using concatenation. The REMOVE function might be useful for this task.

In summary, you can perform cluster sampling of hierarchical data by calling the SURVEYSELECT procedure twice. If you need to perform cluster sampling for data that are in a SAS/IML matrix, PROC IML provides the SAMPLE function and the LOC-ELEMENT trick, which enable you to write your own algorithm to extract random samples from nested data.

Post a Comment

Finding observations that satisfy multiple conditions: The LOC-ELEMENT technique

A common task in data analysis is to locate observations that satisfy multiple criteria. For example, you might want to locate all zip codes in certain counties within specified states.

The SAS DATA step contains the powerful WHERE statement, which enables you to extract a subset of data that satisfy certain criteria. The Sashelp.Zipcode data set contains all US zip codes and the state and county in which they are located. The following DATA step uses the IN operator to create a subset of zip codes for Seminole County (FL), Wake County (NC), Tompkins County (NY), and Hennepin County (MN):

data Subset;
set sashelp.Zipcode;
where StateCode in ("FL" "NC" "NY" "MN") & 
      CountyNM in ("Seminole" "Wake" "Tompkins" "Hennepin");
run;
The previous code will extract all observations that satisfy the specified conditions. If "Seminole" is the name of a county in any of those four states, the observation will be included. For this set of states, each county name matches only one state.

Finding rows that match column values

The SAS/IML language also supports the WHERE clause when reading from a SAS data set, but sometimes the data are in matrices in RAM. What options exist for finding observations in matrices? In this article I'd like to introduce the LOC-ELEMENT technique.

I've previously written about how to use the LOC function in SAS/IML to locate certain elements in a matrix that satisfy a single logical expression. However, when you are searching for observations that satisfy one of several conditions, there is a nice trick that you can use: combine the LOC function with the ELEMENT function.

The ELEMENT function is one of several functions in SAS/IML that enable you to form queries about sets, such as their union, intersection, and difference. The ELEMENT function returns a 0/1 vector that indicates whether each element in a matrix is contained in a specified set of values. The following SAS/IML statements read character variables into a matrix and then use the ELEMENT and LOC functions to find the row numbers that correspond to any observations for which the StateCode variable is in the set {FL, NC, NY, MN}:

proc iml;
use sashelp.Zipcode;
read all var {"StateCode" "CountyNM"} into C[colname=varNames];
read all var {"Zip"};
close sashelp.Zipcode;
 
myStates = {FL NC NY MN};                             /* target set */
StateIdx = loc(element(C[, "StateCode"], myStates));  /* find matches */

The StateIdx variable contains the row numbers that correspond to all zip codes in the four specified states. To get the actual zip codes, you can form the expression zip[StateIdx].

The previous statements contain two tricks that are worth studying. The first is that loc(element(A,B)) finds the indices of the elements in the matrix A that are contained in the set B. The second is that when you use the READ INTO statement, the SAS/IML language automatically assigns matrix attributes to the matrix, so that you can use syntax such as C[, "StateCode"] to refer to the column of C that contains the state codes. For more details on this feature, see the article "Access rows or columns of a matrix by names."

In a similar way, you can find all row numbers for the four specified counties:

myCounties = {"Seminole" "Wake" "Tompkins" "Hennepin"};
CountyIdx = loc(element(C[, "CountyNM"], myCounties));

Notice that these row numbers are independent of the states. In particular, it turns out that Georgia (GA) and Oklahoma (OK) both have counties named "Seminole." You can call the XSECT function to form the intersection of these two sets of row numbers, as follows:

idx = xsect(StateIdx, CountyIdx);
print (C[idx,])[colname=varNames] (zip[idx])[colname="Zip"];

The LOC-ELEMENT technique is superior to the alternative, which is to form a long sequence of logical expressions, as follows:

code = C[, "StateCode"]; 
StateIdx2 = loc( code="FL" | code="NC" | code="NY" | code="MN" );

The logical expression isn't too hard to write for four states, but imagine wanting to extract 20 states or hundreds of counties! The LOC-ELEMENT technique is more compact and less error-prone. Try it; I think you'll like it.

Post a Comment

Matrix computations at SAS Global Forum 2015

Last week I attended SAS Global Forum 2015 in Dallas. It was packed with almost 5,000 attendees. I learned many interesting things at the conference, including the fact that you need to arrive EARLY to the statistical presentations if you want to find an empty seat!

It was gratifying to see that matrix programming with SAS/IML software is alive and well in the SAS statistical community. I was impressed with the number and quality of presentations that featured SAS/IML programs. I attended the following talks or e-posters:

Of course, SAS/IML was included in my presentation Ten Tips for Simulating Data with SAS, which was very well attended. The SAS/IML language was also used in several e-posters and presentations that unfortunately I could not attend due to other commitments. All of the papers for the proceedings are available online, so you can search the proceedings for "IML" or whatever other topic interests you.

It was energizing to listen to and talk with dozens of SAS customers. The conference demonstrated that the SAS/IML language is actively used for optimization, integration, simulation, and matrix computations, which are topics that I blog about frequently. I discussed SAS/IML or general statistical topics with dozens of customers during meals, at coffee breaks, in the hallways, and in The Quad.

At the Opening Session, I learned that the free SAS University Edition has been downloaded more than 250,000 times in the 11 months that it has been available. Because SAS/IML is part of the SAS University Edition, I expect the number of SAS/IML users to continue to grow. Experienced SAS programmers recognize that when an analysis is not available in any SAS procedure, the SAS/IML language is often the best choice for implementing the algorithm.

Post a Comment

An easy way to clear your SAS titles

Did you know that if you have set multiple titles in SAS, that there is an easy way to remove them?

For example, suppose that you've written the following statements, which call the TITLE statement to set three titles:

title "A Great Big Papa Title";
title2 "A Medium-sized Mama Title";
title3 "A Wee Little Baby Title";
proc sgplot data=Sashelp.Cars;
   vbar Origin;
run;
titleclear

Now suppose that you want to remove the values of the TITLEn statements. I have always used three statements to remove the values for the TITLE, TITLE2, and TITLE3 statements, as follows:

title;     /* cancel the TITLE  statement */
title2;    /* cancel the TITLE2 statement */
title3;    /* cancel the TITLE3 statement */

However, a SAS colleague informed me that there is an easier way. It turns out that a TITLEn statement clears titles at level n and higher. That is, you do not need to submit a blank TITLE2 and TITLE3 statements. The TITLE statement is sufficient to remove the first-level title as well as all higher-order titles, so you can write the following statement:

title;     /* cancel the TITLE, TITLE2, and TITLE3 statements */

Oh, and in case you are wondering, the same tip applies to the FOOTNOTE statement. When you cancel a FOOTNOTE statement, all higher-order footnotes are also canceled.

Do you know any cool tips for setting or clearing global SAS statements? Share a tip in the comments.

Post a Comment

Create and use a permutation matrix in SAS

Suppose that you compute the correlation matrix (call it R1) for a set of variables x1, x2, ..., x8. For some reason, you later want to compute the correlation matrix for the variables in a different order, maybe x2, x1, x7,..., x6. Do you need to go back to the data and compute the new correlation matrix (call it R2)? Or can you somehow obtain the new correlation matrix from the old one?

The answer, of course, is that you can create the new correlation matrix from the original matrix. After all, R1 contains all of the correlations between variables. The challenge is to figure out how to rearrange the numbers in R1 so that they form R2.

Permutation matrices

Because the SAS/IML language has built-in support for matrix operations, it is a good tool for converting R1 into R2. Mathematically speaking, this is a classic situation in which a permutation matrix is useful. A permutation matrix is formed by permuting the rows of the identity matrix. I have written about using permutation matrices to create "Secret Santa" lists, where each person in a group gives a present to someone else in the group.

One of the nice properties of a permutation matrix is the ease with which you can permute rows and columns in a second matrix. If P is a permutation matrix and A is any square matrix, then the matrix product P*A permutes the rows of A. Similarly, the multiplication A*P permutes the columns of A.

The following SAS/IML module shows one way to generate a permutation matrix when the permutation is represented by a row vector:

proc iml;
/* create permutation matrix from a permutation, v, where v is a row vector */
start PermutationMatrix(v);
   return( I(ncol(v))[v, ] );       /* generate identity, then rorder rows */
finish;
 
perm = {2 1 7 3 5 8 4 6};          /* permutation of 1:8 */
P = PermutationMatrix(perm); 
print perm, P;
t_permmatrix

The permutation matrix is a reordering of the rows of the identity matrix:

  • The first row of P is the second row of the identity matrix because perm[1]=2.
  • The second row of P is the first row of the identity matrix because perm[2]=1.
  • The third row of P is the seventh row of the identity matrix because perm[3]=7, and so forth.

Reordering a correlation matrix

Let's see how the permutation matrix enables you to reorder a correlation matrix. The following statements compute a correlation matrix for eight variables in the Sashelp.Heart data set, which records measurements about patients in the Framingham Heart Study. The variables are listed in alphabetical order:

/* list variables in alphabetical order */
varNames = {"AgeAtStart" "Cholesterol" "Diastolic" "Height" "MRW" "Smoking" "Systolic" "Weight"};
use Sashelp.Heart;
read all var varNames into X;
close;
 
R1 = corr(X);
print R1[c=varNames r=varNames format=5.2];
t_permmatrix2

The correlation matrix shows the correlations between the alphabetical list of variables. Applying a permutation such as perm will rearrange the variables into a different order. In the same way, you can apply the permutation matrix, P, to permute the rows and columns of R1. In turns out that the correct formula to use is R2 = P*R1*P', as shown below:

Names = varNames[perm];
R2 = P*R1*P`;
print R2[c=Names r=Names format=5.2];
t_permmatrix3

You can verify that R2 is the correct correlation matrix for the reordered variables by repeating the correlation computation: corr(X[,perm]).

Why is this P*R1*P' the correct formula? Well, a permutation matrix is an example of an orthogonal matrix, so P` = P-1. That means that the formula specifies a similarity transformation between R1 and R2 that correspond to a change of bases. The new basis vectors are the specified permutation of the standard basis vectors.

Although I have demonstrated this technique for a correlation matrix, it applies to other matrices such as a covariance matrix or a distance matrix. It can be a useful technique when you rearrange the variable order so that similar variables are clustered together.

An alternative solution

As demonstrated, you can use premultiplication by a permutation matrix to permute rows, and postmultiplication to permute columns. If you aren't comfortable with permutation matrices, there is an easy alternative way to create R2 by permuting the column and rows subscripts of R1:

B  = R1[,perm];       /* permute the columns */
R2 = B[perm,];        /* permute the rows */

So whether you use the permutation matrix to permute the columns and rows of a matrix or whether you permute them directly by using subscripts, the SAS/IML language gives you an easy way to manipulate an array of new multivariate statistics to accommodate a change in the ordering of a list of variables.

Post a Comment

Create a cascade chart in SAS

Sometimes different communities use the same name for different objects. To a soldier, "boots" are rugged, heavy, high-top foot coverings. To a soccer (football) player, "boots" are lightweight cleats. So it is with the term "waterfall plot." To researchers in the medical field, a "waterfall plot" is a sorted bar chart of patients in a clinical trial. I have written about how to create a waterfall plot for a clinical trial in SAS. However, to people in finance and business, a "waterfall chart" is a hanging bar chart that shows the cumulative contributions of a response across several sources. This second kind of waterfall chart is also called a cascade plot or a progressive bar chart. This article describes how to create a cascade chart in SAS.

cascade

A typical cascade chart is shown to the left. It shows a monthly budget for a hypothetical individual. The response variable is money, and this cascade chart shows positive or negative contributions from various sources. Income (colored green) includes his take-home pay from his job, and some money he makes by tutoring. Fixed expenses (colored in red) include his mortgage, student or car loans, various utilities, and groceries. Discretionary expenses (colored orange) include entertainment, clothes, gifts, and his daily trip to Starbucks. At the end of the month, the budget shows leftover money that is earmarked as savings.

How to create a cascade chart (waterfall chart) in SAS

The cascade chart is easy to create in SAS by using the WATERFALL statement in the SGPLOT procedure. You need to specify the categories (in the order that you want them to appear) and the contribution (positive or negative) from each category. Optionally, you can specify a categorical variable that will be used to assign colors to the bars. You do not specify data for the last category; it is computed automatically. The following DATA step creates the data. The values of the TYPE variable are used to color the bars in the cascade chart:

data Budget;
length Name $13 Type $13;
format Amount dollar.;
input Name $13. Amount Type $;
datalines;
Paycheck       5000 Income
Tutoring        500 Income
Mortgage      -2000 Fixed
Loans          -750 Fixed
Utilities      -650 Fixed
Groceries      -600 Fixed
Entertainment  -300 Discretionary
Charity        -250 Discretionary
Other          -300 Discretionary
;
 
title "Waterfall Chart or Cascade Chart";
title2 "Monthly Budget";
proc sgplot data=Budget;
  styleattrs datacolors=(ForestGreen FireBrick OrangeRed);
  waterfall category=Name response=Amount  / colorgroup=Type datalabel
            finalbartickvalue="Savings" finalbarattrs=(color=RoyalBlue);
  keylegend / location=inside down=3 opaque;
  xaxis display=(nolabel);
  yaxis grid display=(nolabel) offsetmin=0;
run;

Each statement is explained below:

  • The STYLEATTRS statement is optional. If it is omitted, then the bars are colored according to the current ODS style. For this example, I wanted green to denote positive cash flow (credits) and red/orange to denote negative flow (debits).
  • The WATERFALL statement specifies the variables for the plot. The CATEGORY= option specifies the categorical variable for the X axis, whereas the RESPONSE= option specifies the numerical variable to plot on the Y axis. The COLORGROUP= option specifies a categorical variable that is used to assign colors to the bars. The FINALBARTICKVALUE= option enables you to assign a name to the final bar, which shows the final cumulative sum.
  • The KEYLEGEND statement is optional. It is used to place the legend in a convenient location.
  • The XAXIS and YAXIS statements are optional. They enable you to specify attributes of the plot axes.

How to interpret a cascade chart

The cascade chart enables you to see how a cumulative total is decomposed into constituent parts. The cascade chart is most useful when you want to show both positive and negative values.

Notice that the bottom of the bar for the second category is level with the top of the bar for the first category. Similarly, the base for the third bar is level with the cumulative contributions from the first two categories. In general, the baseline for the kth bar is placed at the cumulative sum of the first (k-1) categories. Positive quantities result in a bar that increases from the baseline; negative quantities result in bars that point down.

If all of the quantities are positive, then the cascade chart is similar to a stacked bar chart. For example, you could plot the time (response) that it takes to write a book, broken down by various stages: research, first draft, editing, revising, and so forth. The stacked bar chart shows each category as a slice of whole, but it has two drawbacks: it is hard to label small cells and it cannot handle negative values. The cascade chart resolves both of these issues by horizontally displacing the cells.

The categories in the cascade chart are displayed in the order that they appear in the data set. It is up to you to group them in a meaningful way. In this example, I put positive quantities (income) first. This practice gives the chart its name because the chart vaguely resembles a cascade of water down a geological outcrop.

If the categories are ordered, then the chart might have several up-and-down segments. For example, if you are plotting the profit/loss for a sequence of years, the chart might move up during prosperous times and move down during economically challenging conditions.

If you are comparing the response categories to each other, a basic bar chart is probably what you want. But if you are trying to show how a total value is the cumulative sum of contributions from several categories, then the cascade chart is a useful visualization technique.

Post a Comment

Sum a series in SAS

A customer asked:

How do we go about summing a finite series in SAS? For example, I want to compute series
for various integers n ≥ 3. I want to output two columns, one for the natural numbers and one for the summation of the series.

Summations arise often in statistical computing, so this is a great question. I don't understand why the upper limit only goes to n–2, but I think you'll see that the programs below can compute sums for any upper limit.

You can compute summations in the DATA step by using a DO loop. To compute them efficiently in the SAS/IML language, you usually want to construct a vector such that the ith element of the vector is the ith term of the series. Then you use the SUM function to add the terms of the series.

A summation by using the DATA step

It is easy to sum a series by using the DATA step. You set the value of the sum to 0, then loop over the values of i, summing up each term as you go. For this example, the ith term is i / floor(n/i), and the summation is over the terms i=1 to i=n–2. The following DATA step computes this summation for a sequence of values of n:

data Series;
do n = 3 to 10;
   sum = 0;
   do i = 1 to n-2;
      sum = sum + i/ floor(n/i);
   end;
   output;
end;
keep n sum;
run;
 
proc print data=Series noobs; run;
t_summation2

A vectorized summation

In a vector language such as R, MATLAB, or SAS/IML, an efficient summation begins by constructing vectors that contain the elements that are being summed. Suppose that you define i to be a vector that contains the sequence 1, 2, ..., n–2. Then the expression floor(n/i) is also a vector, as is the elementwise ratio of these vectors. These three vectors are shown in the rows of the following table:

t_summation

The bottom row of the table contains the terms of the series. Notice that the terms sum to 6.333, which agrees with the output from the DATA step in the previous section. Use the SUM function to add the terms, as shown in the following function, which computes the summation Sn for an arbitrary value of n > 2:

proc iml;
start SumSeries(n);
   i = 1:(n-2);                    /* index of terms */
   return( sum(i / floor(n/i)) );  /* sum of terms */
finish;

If you want the summation for several values of n, you can use a DO loop to iterate over values of n. The result is the same as for the DATA step program.

n = T(3:10);
sum = j(nrow(n),1);     /* allocate a vector for the results */
do k = 1 to nrow(n);
   sum[k] = SumSeries( n[k] );
end;
print n sum;

A matrix approach to summation

The previous section shows an efficient way to use vectorized operations to compute a summation. However, just for fun, let's see if we can compute the summation for MANY values of n without writing ANY loops! As is typical, a matrix approach uses more memory.

The main idea is to construct a lower triangular matrix whose nth row contains the terms for Sn. You can start by constructing the lower triangular matrix A whose rows contains the vector 1:(n–2) for various values of n. In order to avoid dividing by 0 when you form the expression floor(n/i), you should use missing values for the upper triangular elements of the matrix. The ROW and COL functions are useful for constructing the matrix, as shown below. The ROW and COL functions were introduced in SAS/IML 12.3, but if you are running an earlier version of SAS you can find the definitions in a previous blog post.

nMax = 10;
A = col( j(nMax-2, nMax-2) );      /* each row is {1 2 3 ... 10} */
A[loc(col(A)>row(A))] = .;         /* set upper triangular elements to missing */
*print A;

If you define the column vector n = {3, 4, ..., 10}, then each row of the matrix floor(n/A) contains the denominators for the series. To compute the summation for all values of n, you can form the expression A / floor(n/A) and compute the sums across each row by using the summation subscript reduction operator, as follows:

n = T(3:nMax);         /* column vector */
B = A / floor( n/A );  /* each row contains terms of series for a given n value */
sum = B[,+];           /* sum across rows */
print n sum;

Although I wouldn't ordinarily use the matrix method to sum a series, the technique is useful for constructing structured matrices whose elements are given by a formula. A canonical example is the Hilbert matrix.

Post a Comment