The relationship between skewness and kurtosis

In my book Simulating Data with SAS, I discuss a relationship between the skewness and kurtosis of probability distributions that might not be familiar to some statistical programmers. Namely, the skewness and kurtosis of a probability distribution are not independent. If κ is the full kurtosis of a distribution and γ is the skewness, then it is a mathematical fact that κ ≥ 1 + γ2. In particular, the full kurtosis can never be less than 1. Equivalently, the excess kurtosis, which is κ – 3, can never be smaller than –2.

My book includes SAS programs that readers can use to simulate univariate and multivariate data that satisfy specific statistical relationships. Occasionally readers write to say that the programs report an error when used to simulate from a distribution that has large skewness and small kurtosis. For example, recently a reader wrote to say that "the programs give error messages when I try to simulate data with skewness=2 and kurtosis=3."

My response: Yes, the error message appears because you are asking for the impossible. The values γ=2 and κ=3 do not satisfy the relationship that κ ≥ 1 + γ2. Consequently, there is no probability distribution that has that combination of skewness and kurtosis.

The distribution with the smallest kurtosis

It is easy to show that the simple discrete Bernoulli distribution is the probability distribution that has the least kurtosis for a given amount of skewness. This makes sense intuitively because the kurtosis measures the heaviness of the tails of a distribution. The Bernoulli distribution has no tails; all of the probability mass is near the center.

For the Bernoulli distribution with probability of success p, define q = 1 – p. Then the skewness of the Bernoulli distribution is γ = (1-2p) / sqrt(pq) and the full kurtosis is κ = (1-3pq) / (pq). These values satisfy the equation κ = 1 + γ2, which shows that the kurtosis is as small as possible for a given value of the parameter p.

Consequently, if you want an example of a distribution that has small kurtosis relative to the amount of skewness, choose the Bernoulli distribution. The Bernoulli distribution with p = 0.5 has the smallest kurtosis among all probability distributions. When p = 0.5, the full kurtosis is κ = 1, which is equivalent to an excess kurtosis of –2.

It is worth mentioning that if you have a sample of data and compute the sample estimate of excess kurtosis, the estimate might be smaller than –2. For example, the excess kurtosis for the data {1, 1, 2, 2} is –6. Even in larger samples, you can expect some random samples to have a kurtosis value that is less than the kurtosis of the probability distribution. For example, the following simulation draws 100 samples of size N from the Bernoulli(0.5) distribution. When N=10, the minimum value of the excess kurtosis is –2.57, which is less than the kurtosis of the distribution. If you increase N, the minimum kurtosis value gets closer to the population value of –2.

proc iml;
N = 10;                 /* sample size */
NumSamples = 100;       /* number of samples to draw */
call randseed(12345);
x = randfun(N//NumSamples, "Bernoulli", 0.5);
kurt = kurtosis(x);
minKurt = min(kurt);
print minKurt;

For more information about skewness and kurtosis, see the articles "Skew this" and "Does this kurtosis make my tail look fat?"

Post a Comment

IF-THEN logic with matrix expressions

In the SAS DATA step, all variables are scalar quantities. Consequently, an IF-THEN/ELSE statement that evaluates a logical expression is unambiguous. For example, the following DATA step statements print "c=5 is TRUE" to the log if the variable c is equal to 5:

if c=5 then put "c=5 is TRUE";
else put "c=5 is FALSE";

In the SAS/IML language, the situation can be more complicated because a variable is usually a matrix or a vector. You can use matrices in logical expressions, but when c is not a scalar quantity, what does the statement if c=5 then... mean? What happens when some of the elements of c are 5, but other elements are not?

The following SAS/IML example illustrates the situation:

proc iml;
c = {5 5 6};
if c=5 then 
   print "c=5 is TRUE";
   print "c=5 is FALSE";

According to the output, the logical expression c=5 evaluates to FALSE. This result can be understood by reading the documentation for the IF-THEN/ELSE statement, which says, "when the condition to be evaluated is a matrix expression, ... this evaluation is equivalent to using the ALL function." In other words, the IF-THEN/ELSE statement is equivalent to the following:

if all(c=5) then 
   print "all(c=5) is TRUE";
   print "all(c=5) is FALSE";

The ALL function evaluates to TRUE when all of the elements in a matrix are nonzero and nonmissing. As I recently mentioned, you can use the RESET PRINTALL statement to see temporary matrices that SAS/IML creates when it evaluates logical expressions. The expression c=5 results in the temporary matrix {1 1 0}. The ALL function returns 0 (FALSE) for that matrix.

If you want a condition to evaluate to TRUE if any element in a matrix satisfies a criterion, then use the ANY function explicitly, as follows:

if any(c=5) then 
   print "any(c=5) is TRUE";
   print "any(c=5) is FALSE";

The takeaway lesson is that a matrix expression can be used in an IF-THEN/ELSE statement in the SAS/IML language. The THEN condition will be executed when all of the elements in the matrix satisfy the expression. The ELSE condition will be executed if one or more elements do not satisfy the expression.

Post a Comment

What is an empty matrix?

At the beginning of my book Statistical Programming with SAS/IML Software I give the following programming tip (p. 25):

Do not confuse an empty matrix with a matrix that contains missing values or with a zero matrix. An empty matrix has no rows and no columns. A matrix that contains missing values has at least one row and column, as does a matrix that contains zeros.

What do those sentences mean? Consider the following SAS/IML program:

proc iml;
z = j(2, 2, 0);       /* zero matrix = {0 0, 0 0} */
m = j(1, 4, .);       /* matrix of missing = {. . . .} */
q = {};               /* empty matrix */

The J function in SAS/IML software creates a matrix that has a specified number of rows and columns and fills the matrix with a constant value. The syntax is result = j(nrows, ncols, value). The first statement in the program creates a matrix named z that has two rows and two columns. Each element of the matrix has the value 0. The matrix is also called a "zero matrix." You can print this matrix and use it in expressions that involve matrix addition, matrix multiplication, logical expressions, and so forth.

The second statement creates a matrix named m that has one row and four columns. Each element is a missing value. The matrix is called a "matrix of missing values." You can print this matrix. You can use it in some elementwise matrix operations, but not in matrix multiplication.

The third statement creates an empty matrix. It has zero rows and zero columns. It does not have any elements. You cannot add with it, multiply with it, or even print it. However, you can ask for its dimensions, concatenate with it, and use it in set operations such as unions and intersections.

The syntax q = {} was introduced in SAS/IML 12.1. Read my previous article for ways to create and detect empty matrices.

In conclusion, do not confuse an empty matrix with a matrix of missing values. They are very different. The former has no rows or columns. The latter has rows and columns, but every element is a missing value.

Post a Comment

Finding matrix elements that satisfy a logical expression

A common task in SAS/IML programming is finding elements of a SAS/IML matrix that satisfy a logical expression. For example, you might need to know which matrix elements are missing, are negative, or are divisible by 2.

In the DATA step, you can use the WHERE clause to subset data. You can also use IF-THEN/ELSE logic to examine each observation in a data set to determine whether it satisfies a criterion. For data that are stored in a SAS/IML matrix, some programmers write a loop that iterates over the rows and/or columns and use IF-THEN/ELSE logic on each scalar element in the matrix. However, there is an easier and more efficient way: Use the LOC function, which finds the LOCation of elements that satisfy a logical expression.

This article provides examples of using the LOC function and a logical expression to find elements in a matrix that satisfy a criterion. It also points out that the SAS/IML language creates temporary 0/1 matrices from logical expressions.

The following SAS/IML statements define a 3 x 4 matrix. The LOC function is used to find the location of the elements that are missing, negative, and even, respectively:

proc iml;
x = {. -5 2  5,
    -2  . 3  4,
     4  . . -1};
missingLoc = loc(x=.);             /* missing values */
negativeLoc = loc(x^=. & x<0);     /* nonmissing and negative  */
evenLoc = loc(mod(x,2)=0);         /* n is even if (n mod 2)=0 */
print missingLoc, negativeLoc, evenLoc;

The output shows the location of the elements for each criterion. The elements of the matrix are enumerated in row-major order: the first row contains elements 1–4, the second row contains elements 5–8, and the third row contains elements 9–1. The program output shows that missing values are stored at locations 1, 6, 10 and 11. Negative values are stored at locations 2, 5, and 12. Even integers occur at locations 3, 5, 8, and 9.

Even if you are familiar with the LOC function, you might not be familiar with the way SAS/IML passes a logical expression like x=. or x<0 into the LOC function. When the SAS/IML parser sees an expression that involves a comparison operator, it evaluates the logical expression, which creates a temporary matrix of 0s and 1s. That temporary 0/1 matrix is sent to the LOC function. You can observe this behavior by using the RESET PRINTALL statement to display all matrices that SAS/IML creates, including the temporary matrices. If you add the statement

reset printall;

as the second line of the previous program, then the following output is displayed when the first LOC function executes:


The output shows that PROC IML creates a temporary matrix named _TEM1001. The matrix is the result of evaluating the logical expression x=.. Notice that the 0/1 matrix contains a 1 in the locations for which the data are missing and a 0 in the other locations. Similarly, the other logical expressions in the program create temporary matrices with names like _TEM1002 and _TEM1003.

There are two takeaways from this example:

  • Use the LOC function to find the location of matrix elements that satisfy a logical condition. The LOC function is more efficient than an explicit loop over the matrix elements.
  • The argument that is passed to the LOC function is always a matrix. When you specify a logical condition, SAS/IML creates temporary variables from the logical expressions. You can use the RESET PRINTALL statement to see the temporary variables that are created.

For more information, here are two related posts:

Post a Comment

Calling a global statement inside a loop

The other day I was creating some histograms inside a loop in PROC IML. It was difficult for me to determine which histogram was associated with which value of the looping variable. "No problem," I said. "I'll just use a TITLE statement inside the loop so that each histogram has a different title."

Easier said than done.

You see, the TITLE statement requires an argument that is a literal string value. Because the TITLE statement is a "global statement" in SAS and not a statement in the SAS/IML language, the argument cannot be a SAS/IML variable. In other words, the following attempt to set the title inside a loop is not correct:

ods graphics / width=400px;
proc iml;
x = j(50,1);
do mu = 0 to 1;
   call randgen(x, "Normal", mu);   /* x ~ N(mu, 1) */
   msg = "mu = " + strip(char(mu)); /* concatenate strings */
   title msg;                       /* try to form title (DOESN'T WORK!) */ 
   call histogram(x);

You might try to use macro variables to change the title within the loop. For example, you might be tempted to set create a macro variable inside the loop and write the title statement as

   title "mu = &mu";               /* use macro variable (DOESN'T WORK!) */

However, that attempt also fails, for reasons that I have described in a previous article about macro variables and SAS/IML loops.

The solution that I came up with requires using the CALL EXECUTE subroutine in SAS/IML. The EXECUTE subroutine enables you to execute any string that represents a valid SAS or SAS/IML statement. In particular, you can create the string dynamically by using string concatenation operations, and then create the TITLE statement in a similar way. You can use the CALL EXECUTE subroutine to run the TITLE statement for each iteration of the loop. Here's one way to do it:

/* module that executes the TITLE statement on an arbitrary string */
start SetTitle(msg);
   cmd = 'TITLE "' + msg + '";';    /* create TITLE stmt by concatenating */
   call execute(cmd);
x = j(50,1);
do mu = 0 to 1;
   call randgen(x, "Normal", mu);   /* x ~ N(mu, 1) */
   msg = "mu = " + strip(char(mu)); /* concatenate strings */
   run SetTitle(msg);
   call histogram(x);

There are a few subtle tricks in this code that are worth explaining:

  • In the SetTitle module, single quotes are used to form the string in the cmd variable. That is because the string itself contains double quotes.
  • The string in the cmd variable contains a semicolon at the end.
  • The EXECUTE subroutine is called from a module. Prior to SAS/IML 13.2, it was not possible to call the EXECUTE subroutine in the main scope of the program. This restriction has been lifted in recent versions of SAS/IML.

Obviously the same technique will use with other statements that require literal strings, including the FOOTNOTE and LIBNAME statements. However, you can use this technique to call SAS/IML statements as well. For example, you can call the MATTRIB statement within a loop to assign matrix attributes to multiple matrices that are enumerated in a list.

Post a Comment

Twelve posts from 2014 that deserve a second look

I began 2015 by compiling a list of popular articles from my blog in 2014. Although this "People's Choice" list contains many interesting articles, some of my favorites did not make the list. Today I present the "Editor's Choice" list of articles that deserve a second look. I've highlighted one article from each month of 2014.

I've grouped the articles into three broad topics: simulating data, statistical graphics, and matrix computations.

Simulating data and resampling

Statistical graphics

Matrix computations


Many of the previous articles use the SAS/IML language. In 2014 SAS made the SAS/IML language freely available. The IML procedure is included as part of the SAS University Edition, which is free for students, professors, researchers, and adult learners.

Two additional articles from 2014 are worth highlighting because of their importance to SAS/IML programmers:

There you have it, 12 topics that I think are worth a second look. What was your favorite article from The DO Loop in 2014? Leave a comment.

Post a Comment

Compute the kth smallest data value in SAS

A SAS programmer recently posted a question to the SAS/IML Support Community about how to compute the kth smallest value in a vector of numbers. In the DATA step, you can use the SMALLEST function to find the smallest value in an array, but there is no equivalent built-in function in the SAS/IML language. However, it is easy to write a short module that performs the computation.

If the data include missing values, there is an ambiguity in the question. Do you want the kth smallest value after omitting the missing values? Or do you want the kth smallest value in the ordered list where missing values appear before nonmissing values? This article considers both cases.

The smallest value in an array

The DATA step has built-in functions for both cases. The SMALLEST function returns the kth smallest nonmissing value. The ORDINAL function sorts the missing and nonmissing values and returns the kth element in the ordered list. These functions act on elements of a DATA step array. Consequently, if you want to find the kth element in a column of a data set, you need to transpose the data by using PROC TRANSPOSE.

The following example creates a data set with 10 observations, two of which are missing. The data set is transposed and the data are read into an array. The small3 variable contains the third smallest nonmissing value, ignoring the missing values. The ord3 variable contains the third smallest ordinal value, where missing values are ordered before nonmissing values. The small7 and ord7 variables contain the seventh smallest values, excluding and including the missing values, respectively.

/* Base SAS: use SMALLEST or ORDINAL with array to compute k_th smallest value */
data A;
input x @@;
. -5 -2 0 2 5 2 2 4 .
proc transpose data=A out=B;
data C;
   set B;
   array x[*] _NUMERIC_;
   ord3   = ordinal(3, of x[*]);      /* include missing values */
   ord7   = ordinal(7, of x[*]);
   small3 = smallest(3, of x[*]);     /* ignore missing values */
   small7 = smallest(7, of x[*]);
proc print noobs data=C;
   var ord3 ord7 small3 small7;

The smallest value in a vector

The SAS/IML language does not have built-in functions that are equivalent to the SMALLEST and ORDINAL functions. However, you can easily sort the data and extract the kth smallest value. The following statements define a function named ORDINAL that takes a SAS/IML vector as an argument:

proc iml;
start ordinal(k, _x);
   x = colvec(_x);       /* make sure it is a column vector */
   call sort(x);
   if k<1 | k>nrow(x) then
      return (.);        /* return missing if index out of range */
   return( x[k] );       /* value of the k_th ordered element */
/* test the module by using the example data */
use A;  read all var {"x"};  close A;
ord1 = ordinal(1, x);
ord3 = ordinal(3, x);
ord7 = ordinal(7, x);
print ord1 ord3 ord7;

You can see that the ORDINAL module returns the kth value of the data, including missing values. In fact, the argument k can be a vector. The statement ord = ordinal({1 3 7}, x) returns a vector that contains the 1st, 3rd, and 7th ordered elements.

If you want to exclude missing values, you can use the LOC function to locate nonmissing values and use the subscript operator to extract them. You can then call the ORDINAL module on the vector of nonmissing values, as shown below:

start smallest(k, _x);
   idx = loc(_x^=.);               /* find nonmissing values */
   if IsEmpty(idx) then return(.); /* all values are missing */
   x = _x[idx];                    /* extract nonmissing */
   return ( ordinal(k, x) );       /* call the previous module */
small1 = smallest(1, x);
small3 = smallest(3, x);
small7 = smallest(7, x);
print small1 small3 small7;

I leave it as an exercise to the reader to modify the definition of the ORDINAL module so that the argument k can be a vector. In other words, modify the function so that the statement s = smallest({1 3 7}, x) returns a vector with three elements.

If x is a matrix and you want the smallest elements of each column, just call the modules in a DO loop.

Do you ever use the SMALLEST or ORDINAL functions in the DATA step? For what purpose?

Post a Comment

Popular posts from The DO Loop in 2014

I published 118 blog posts in 2014. This article presents my most popular posts from 2014 and late 2013.

2014 will always be a special year for me because it was the year that the SAS University Edition was launched. The University Edition means that SAS/IML is available to all students and adult learners all over the world. My article "10 tips for learning the SAS/IML language" has been popular as new programmers take advantage of this free opportunity to learn the SAS/IML language for matrix computations and data analysis.

General math and statistics articles

Although I mostly write about statistical programming, several of my general articles on math and statistics were very popular:

Statistical Graphics and Data Visualization

Who doesn't like to learn better ways to visualize data in SAS? The following posts generated some interesting discussions.

Visualizing Matrices

An important visualization is to visualize the data in a rectangular matrix by using a heat map:

Regression and data smoothers

Everyone loves a good regression tip!

Start your new year by (re-)reading one of these 14 popular posts from 2014. Happy New Year to all my readers!

Post a Comment

Self-similar structures from Kronecker products

I recently posted an article about self-similar structures that arise in Pascal's triangle. Did you know that the Kronecker product (or direct product) can be used to create matrices that have self-similar structure?

The basic idea is to start with a 0/1 matrix and compute a sequence of direct products of the matrix with itself. If M is a matrix that contains only zeros and ones, then M@M is a block matrix where a block of zeros occurs where M has a 0, and a copy of M appears where M has a 1, as shown by the following SAS/IML program:

proc iml;
M = {0 1 0,
     1 1 1,
     0 1 0};
M2 = M @ M;
print M2;

In this example, the original matrix, M, is 3 x 3. The matrix M@M is a 3 x 3 block matrix, where each block is either the zero matrix or a copy of M. If you apply the Kronecker product again and form the matrix M@M@M, you will get a 9 x 9 block matrix, where each block is either the zero matrix or a copy of M.

You can continue this process. The k-fold product of M with itself is a 3k x 3k matrix. Arthur Charpentier provides a nice animation of this iterative process on his Freakonometrics blog, which is where I first read about using the Kronecker product of a 0/1 matrix to create a self-similar structure. If you like to read articles about mathematical aspects of probability and statistics, I highly recommend Charpentier's well-written and always interesting blog.

Printing a 0/1 matrix is not an effective visualization. Instead, I will create a two-color heat map by using the HEATMAPDISC subroutine in SAS/IML 13.1. Because I want to create several heat maps, I wrote the following SAS/IML module that composes a matrix with itself k times and draws a two-color heat map of the result. The following heat map shows the structure of the four-fold Kronecker product M@M@M@M:

start KroneckerPlot(M, Iters);
   N = 1;
   do i = 1 to Iters;
      N = N @ M;
   call heatmapdisc(N) title="Iteration: "+strip(char(Iters)) 
        colorramp={White Blue} displayoutlines=0 ShowLegend=0;
ods graphics / width=800px height=800px;
run KroneckerPlot(M, 4);

Let's say it all together: "Ooooooh!" "Ahhhhh!"

The structures that you can create depend on the arrangement of zeros and ones in the original matrix. Notice that these product matrices get big fast: the k-fold Kronecker product of an n x n matrix is an nk x nk matrix. You might run out of memory if you try to visualize a k-fold product when k or n is greater than 5. The next statements start with a 4 x 4 matrix and display a heat map for the five-fold product, which is a 1024 x 1024 matrix:

M = {1 1 1 1,
     1 1 0 0,
     1 0 1 0,
     1 0 0 1};
run KroneckerPlot(M, 5); /* 5 iterations of 4x4 matrix ==> 1024 x 1024 matrix */

I call this image "Landing at LaGuardia."

Obviously you can create other structures. You can even generate random structures by using the Bernoulli distribution to create a random 0/1 matrix. I leave you with a few additional statements that create self-similar images. Feel free to create your own matrices. Leave a comment and tell me your favorite matrix for creating a self-similar structure.

ods graphics / width=500px height=500px;
M = {1 1 1 1,
     1 0 0 1,
     1 0 0 1,
     1 1 1 1};
run KroneckerPlot(M, 4);
/* Make random fractal plots! */
call randseed(54321);
do i = 1 to 5;
   M = randfun({4 4}, "Bernoulli", 0.75);
   run KroneckerPlot(M, 3);
Post a Comment

Elementwise minimum and maximum operators

Like most programming languages, the SAS/IML language has many functions. However, the SAS/IML language also has quite a few operators. Operators can act on a matrix or on rows or columns of a matrix. They are less intuitive, but can be quite powerful because they enable you perform computations without writing a DO loop. This article describes the elementwise minimum and maximum operators.

Recently I wrote about how to use subscript operators to compute the min or max of a row or column of a data matrix. These min/max operators are two examples of subscript operators that perform row and column operations on a data matrix. Other subscript operators enable you to compute the sum, mean, and sum of squares for rows or columns of a matrix.

But what if you need to compare the values across two different matrices? To paraphrase a popular ad campaign, "there's an op for that!"

The elementwise minimum operator

Here's an example that illustrate how the elementwise minimum operator works. Suppose that there is a teacher who always gives a bonus question on her quizzes and tests. Quizzes are worth 50 points, tests are worth 100 points, and the bonus question is worth 5 points. However, the teacher does not allow any student to score more than 50 points on a quiz or more than 100 points on a test.

The following SAS/IML statements define the raw scores for three students in the class:

proc iml;
Test = {"Quiz1" "Test1" "Quiz2" "Test2"};
Name = {"Rita", "Sam", "Tim"};
x = {55 105 50 105,
     45  95 55  90,
     15 100 55 105};
print x[r=Name c=Test L="Raw Scores"];

You can see that on several occasions a student has answered all questions correctly, including the bonus question. The teacher wants to cap the scores at some maximum value. The elementwise minimum operator (><) is an easy way to perform this operation. To ensure that 100 is the maximum possible score, the teacher could apply the >< operator, just as in the SAS DATA step:

trunc100 = x >< 100;             /* result is minimum of x[i,j] and 100 */
print trunc100[r=Name c=Test L="Max Score = 100"];

Now the maximum value of any cell is 100. However, columns 1 and 3 represent quiz scores, so they need to be truncated at the maximum value of 50. You might be tempted to loop over the columns and use the CHOOSE function to cap the maximum value of each column, as follows:

target = {50 100 50 100};    /* target[i] is max allowed value of column i */
A = j(nrow(x), ncol(x));     /* allocate new matrix for results */
do j = 1 to 4;
   A[,j] = choose(x[,j] > target[j], target[j], x[,j]);

However, the elementwise minimum operator enables you to compute the result without writing a loop. Because the target vector is a row vector with four columns, the following statement returns a matrix where element (i,j) is the minimum of A[i,j] and target[j]:

A = (x >< target);   /* target[j] is max allowed value of column j */
print A[r=Name c=Test L="Adjusted Scores"];

The elementwise maximum operator

In a similar way, the elementwise maximum operator (<>) enables the teacher to set the smallest possible value for a test score. The teacher might decide that extremely low scores are unduly influential when computing the average grade, so she might decide to make 25 the lowest possible score, as follows:

B = A <> 25;             /* result is maximum of A[i,j] and 25 */

The elementwise minimum and maximum operators work when the second matrix is a scalar, a row vector, a column vector, or a matrix.

  • The expression (A >< scalar) returns a matrix that is the same size of A, and no element is smaller than scalar.
  • The expression (A >< row_vector) returns a matrix that is the same size of A, and no element of the jth column is smaller than the jth element of row_vector.
  • The expression (A >< col_vector) returns a matrix that is the same size of A, and no element of the ith row is smaller than the ith element of col_vector.
  • The expression (A >< matrix) returns a matrix that is the same size of A, and the (i,j)th element is no smaller than the (i,j)th element of matrix.

Truncating vector values

The elementwise min/max operators can be used to truncate a vector of values. This often comes up when data has small negative values (possibly because of numerical round off) and you want to truncate all negative values to 0.

Another example is when you want to truncate the values of a function. For example, the sine function returns values in the range [–1, 1]. If you want to truncate the sine function at ±1/2, you can use the following shorthand notation:

ods graphics / width=400px height=200px;
t = do(0,12.56,0.03);
z = (-0.5 <> sin(t) >< 0.5); /* truncate within [-0.5, 0.5] */
title "Truncated Sine Function";
call series(t, z);
Post a Comment