On the determinant of the Hilbert matrix

Last week I described the Hilbert matrix of size n, which is a famous square matrix in numerical linear algebra. It is famous partially because its inverse and its determinant have explicit formulas (that is, we know them exactly), but mainly because the matrix is ill-conditioned for moderate values of n. Consequently, a small Hilbert matrix (often n < 10) can be used to test numerical algorithms to see how they perform on an ill-conditioned or nearly singular matrix.

The Hilbert matrix can teach us a lesson about how to compute with large numbers. The determinant of the Hilbert matrix has an explicit formula that involves the product of factorials. If you define the function
c(n) = 1! · 2! · ... · (n – 1)!
then the determinant of the Hilbert matrix of size n is
det(Hn) = c(n)4 / c(2n)
This ratio goes to zero very quickly as n increases. Yes, the numerator is large, but the denominator is even larger, as shown by the following SAS/IML program:

proc iml;
start c(n);
   return( prod(fact(1:n-1)) );   /* 1!*2!*...*(n-1)! */
finish;
 
maxN = 8;
numer = j(maxN,1); denom = j(maxN,1); det = j(maxN,1);
do n = 1 to maxN;
   numer[n] = c(n)##4;  denom[n] = c(2*n);
   det[n] = numer[n] / denom[n];
end;
print (T(1:maxN))[L="n"] numer denom det;
t_hilbertdet

You can see that the determinant of the Hilbert matrix is practically zero for n > 5, so most numerical algorithms would conclude that the matrix is singular. However, in exact arithmetic, the determinant of the Hilbert matrix is the reciprocal of a large integer, so it is always positive.

How to numerically compute with super-factorial quantities?

The Hilbert matrix illustrates two common problems that arise when computing with huge numbers. First, the function c(n) results in a numerical overflow when n > 27, so we cannot use that function to compute the determinant for more than a 13 x 13 Hilbert matrix. Second, the determinant computation involves the ratio of huge numbers, so the accuracy of the determinant is questionable.

A standard technique for computing with such huge numbers is to transform the quantities, usually by taking logarithms. If we can compute the log-determinant, then we "know" the value of the actual determinant even when we cannot compute the determinant with double-precision arithmetic.

I have previously shown how to numerically compute the log-determinant of a general matrix. However, for the Hilbert matrix the determinant is known exactly, so we can compute the exact log-determinant. The formula for the log-determinant is log(det(Hn)) = 4*log(c(n)) – log(c(2n)).

Although c(n) involves products of factorials, log(c(n)) only involves the sums of logs of integers. Why is that? Well, since the log of a product is the sum of the logs, the quantity log(k!) can be computed in the SAS/IML language as sum(log(1:k)). Likewise, the product of factorials reduces to a sum of log-factorials, as shown in the following SAS/IML function for the quantity log(c(n)):

/* compute log(c(n)) */
start logc(n);
   s = 0;           /* log(1!) */
   do i = 2 to n-1;
      s = s + sum(log(1:i));
   end;
   return( s );
finish;
 
/* compute log(c(n)) and the log-determinant of the Hilbert matrix */
logcn = j(maxN,1); logdet = j(maxN,1);
do n = 1 to maxN;
   logcn[n] = logc(n);
   logdet[n] = 4*logc(n) - logc(2*n);
end;
print (T(1:maxN))[L="n"] logcn[L="logc(n)"] logdet[L="log(det(H(n)))"];
t_hilbertdet2

The difference is amazing. Whereas the quantity c(8) is huge and the quantity det(H8)) is tiny, the logs of these quantities are reasonable sizes. Although c(100) is astronomically huge, you can easily compute the log of c(100) and the log of det(H100)).

You might never have a need to compute the determinant of a Hilbert matrix, but the ideas in this article apply to many similar computational situations. For example, in maximum likelihood estimate, statistical software usually computes the log-likelihood function rather than the likelihood function. So next time that you find yourself computing a quantity that involve huge numbers (or tiny numbers), consider computing those quantities on a logarithmic scale. It will make a huge difference.

Post a Comment

Summary of new features in SAS/IML 12.1

I enjoy blogging about new functionality in the SAS/IML language because I can go into more depth and provide more complicated examples than the SAS/IML documentation. Today's article is a summary of all of my posts about features that were added to SAS/IML 12.1, which shipped in August 2012 as part of SAS 9.3m2. For complete details of all 12.1 features, see the documentation chapter "What's New in SAS/IML 12.1."

New SAS/IML functions and subroutines

I have blogged about the following functions that were added in SAS/IML 12.1:

The only SAS/IML 12.1 function that I have not blogged about is the DIMENSION function, which returns a 1 x 2 vector whose elements are the number of rows and columns, respectively, of a matrix. In other words, dimension(A) is equivalent to the expression (nrow(A) || ncol(A)). There! Now I've blogged about all SAS/IML 12.1 functions!

Enhancements to the SAS/IML syntax

There were several enhancements and language improvements in SAS/IML 12.1. Here are the ones that I have blogged about:

It can be hard to keep up with enhancements to SAS software. Hopefully this reference page will be a useful to SAS/IML users who are upgrading their version of SAS.

Post a Comment

How to format decimals as fractions in SAS

Yesterday I blogged about the Hilbert matrix. The (i,j)th element of the Hilbert matrix has the value 1 / (i+j-1), which is the reciprocal of an integer.

However, the printed Hilbert matrix did not look exactly like the formula because the elements print as finite-precision decimals. For example, the last column of the matrix of size 5 is {0.2, 0.1666667, 0.1428571, 0.125, 0.1111111}. A colleague jokingly asked, "shouldn't the matrix contain fractions like 1/5, 1/6, 1/7, 1/8, and 1/9?"

To his surprise, I responded that SAS can actually print the matrix elements as fractions! SAS contains the FRACTw. format, which makes it easy to print decimals as their fractional equivalent in reduced form. Here is yesterday's matrix, printed as fractions:

print H[format=FRACT.];

I sometimes marvel at the variety of formats that are available in SAS software. From printing integers as Roman numerals to printing decimals as fractions, it seems like SAS has a format for all occasions.

What is your favorite SAS format? Why? Leave a comment.

Post a Comment

The Hilbert matrix: A vectorized construction

The Hilbert matrix is the most famous ill-conditioned matrix in numerical linear algebra. It is often used in matrix computations to illustrate problems that arise when you compute with ill-conditioned matrices. The Hilbert matrix is symmetric and positive definite, properties that are often associated with "nice" and "tame" matrices.

The Hilbert matrix is definitely not tame, but it is nice in the sense that it is easy to construct. If H is a Hilbert matrix, then H[i,j] = 1 / (i+j-1). Notice that the elements are the reciprocals of integers and the matrix is banded along the diagonals where i+j is constant.

I have previously blogged about three ways to vectorize the construction of a structured matrix. The Hilbert matrix provides another example to practice using vector and matrix computations, rather than write DO loops in the SAS/IML language. Let's look at three ways to construct the Hilbert matrix.

Method 1: Nested DO loops

The straightforward way to construct a Hilbert matrix is by writing nested iterative loops. This naive approach is not very efficient in a matrix-vector language such as SAS/IML, MATLAB, or R.

proc iml;
/* First attempt: nested DO loops. Inefficient */
n=5;
H = j(n,n);
do i = 1 to n;
   do j = i to n;
      H[i,j] = 1 / (i+j-1);
      H[j,i] = H[i,j];
   end;
end;
print H;

In the preceding code, the loops take advantage of the symmetry of the Hilbert matrix. However, the method is inefficient because it loops over all rows and columns.

Method 2: Constructing each row as a vector

In my previous article, I proposed identifying constant expressions within a loop and replacing the inner loop with a vector operation. That technique works with the Hilbert matrix. Notice that the expression j can be a vector that does not depend on the looping variable i. Consequently, you can rewrite the construction in the SAS/IML language as follows:

/* Second attempt: One DO loop over rows, column numbers stored in a vector */
H = j(n,n);
j = 1:n;
do i = 1 to n;
   H[i, ] = 1 / (i+j-1);
end;

Method 3: Use matrix operations

There are two kinds of arithmetic operations in linear algebra: matrix multiplication and elementwise operations. If you use elementwise division, the construction of the Hilbert matrix becomes a one-liner. However, for clarity I will use several intermediate matrices. You can use the ROW function in SAS/IML 12.3 software to construct the matrix that has the constant value i for every element of the ith row. Similarly, you can use the COL function to construct the matrix that is constant along columns. (If you do not have SAS/IML 12.3 or later, use the functions from a previous blog on structured matrices.)

For convenience, this final algorithm is written as a callable function, and I avoid using the ROW and COL functions in case some readers are running older versions of SAS/IML software. In the following function, matrices are placed in the denominator of a fraction and elementwise division is used to produce the matrix of reciprocals:

/* Create an n x n Hilbert matrix. No DO loops. */
start Hilbert(n);
   i = repeat(T(1:n), 1, n); /* = row(H) */
   j = repeat(1:n, n);       /* = col(H) */
   return( 1 / (i + j - 1) );
finish;
 
H = Hilbert(5);

There you have it: A vectorized construction of the n x n Hilbert matrix. It is always good to practice vectorizing computations, and this exercise shows how to efficiently construct one of the most famous ill-conditioned matrices in numerical linear algebra.

Post a Comment

Construct a stacked bar chart in SAS where each bar equals 100%

I enjoy reading the Graphically Speaking blog because it teaches me a lot about ODS statistical graphics, especially features of the SGPLOT procedure and the Graph Template Language (GTL). Yesterday Sanjay blogged about how to construct a stacked bar chart of percentages so that each bar represents 100%. His chart had the additional feature of displaying the percentages for each category. His post showed how to use the SGPLOT procedure to duplicates the functionality of the G100 option in the GCHART procedure.

In this blog post I present an alternative to Sanjay's construction. When I construct a stacked bar chart, I use PROC FREQ to compute the percentages of each category within a group. I then use the VBAR or HBAR statements in the SGPLOT procedure to construct the stacked bar chart. For example, the bar chart at the top of this post is constructed by using the following statements:

proc sort data=sashelp.cars(where=(Type^="Hybrid")) out=cars;
by Origin;                     /* sort X categories */
run;
 
proc freq data=cars noprint;
by Origin;                    /* X categories on BY statement */
tables Type / out=FreqOut;    /* Y (stacked groups) on TABLES statement */
run;
 
title "100% Stacked Bar Chart";
proc sgplot data=FreqOut;
vbar Origin / response=Percent group=Type groupdisplay=stack;
xaxis discreteorder=data;
yaxis grid values=(0 to 100 by 10) label="Percentage of Total with Group";
run;

The VBAR statement in the SGPLOT procedure creates the stacked bar chart. Each bar totals 100% because the Percent variable from PROC FREQ is used as the argument to the RESPONSE= option. The GROUPDISPLAY=STACK option stacks the groups so that there is one bar per category.

Creating stacked bars ordered by percentages

A variation of the 100% stacked bar chart is to order the "slices" of the bars by their relative sizes. This is shown at the left. I used the ORDER= option on the PROC FREQ statement to output the counts in descending order for each bar. If I use the GROUPORDER=DATA option on the VBAR statement in PROC SGPLOT, the categories will be arranged and colored according to how they appear in the data. In particular, the first bar will be ordered by relative percentages, as shown by the "Asia" bar. Notice, however, that this does not guarantee that other bars are ordered by size, as seen by the "Europe" bar.

Still, I like to order the groups by size because it makes it easier to find the relative percentages of the most important groups. In general, I prefer to order charts by some quantity, rather than to use the default alphabetical ordering.

The following statements produce the 100% stacked bar chart with ordered groups, assuming that the data set is already sorted according to the X variable:

proc freq data=cars order=freq noprint;  /* ORDER= sorts by counts within X */
by Origin;                               /* X var */
tables Type / out=FreqOutSorted;         /* Y var */
run;
 
proc print data=FreqOutSorted; run;
 
title "100% Stacked Bar Chart Ordered by Percentages";
proc sgplot data=FreqOutSorted;
vbar Origin / response=Percent group=Type 
              grouporder=data groupdisplay=stack; /* order by counts of 1st bar */
xaxis discreteorder=data;
yaxis grid values=(0 to 100 by 10) label="Percentage of Total with Group";
run;

PROC FREQ does it, too! (SAS 9.4m1)

Over the last few releases there have been quite a few new ODS graphs that come out of PROC FREQ "for free." If you have SAS 9.4m1, you can use the SCALE=GROUPPERCENT option to create a stacked bar chart similar to the one in the previous section:

proc freq data=cars order=freq;
tables Type*Origin / plots=freqplot(twoway=stacked scale=grouppct); 
run;

The lesson to learn is that you can use relatively simple statements (my post) to produce basic stacked bar charts or more sophisticated SAS code (Sanjay's post) to produce more complicated bar charts. Choose the method that you prefer, and let SAS do the rest.

Post a Comment

Vector and matrix norms in SAS

Did you know that SAS/IML 12.1 provides built-in functions that compute the norm of a vector or matrix? A vector norm enables you to compute the length of a vector or the distance between two vectors in SAS. Matrix norms are used in numerical linear algebra to estimate the condition number of a matrix, which is important in understanding the accuracy of numerical computations.

Vector norms are probably familiar to many readers. The vector norms that are used most often in practice are as follows:

  • The L1 or "Manhattan" norm, which is the sum of the absolute values of the elements in a vector
  • The L2 or "Euclidean" norm, which is the square root of the sum of the squares of the elements
  • The L or "Chebyshev" norm, which is the maximum of the absolute values of the elements

The following program demonstrates these vector norms:

proc iml;
u = {-2, 4, -2, 1};
Length = norm(u);     /* Euclidean norm = sqrt(ssq(u)) */
L1 = norm(u, "L1");           /* L1 norm = sum(abs(u)) */
Linf = norm(u, "Linf");     /* infinity norm = max(u)  */
print Length L1 Linf;

Matrix norms might be less familiar to some readers. A matrix norm tells you how much a matrix can stretch a vector. For a matrix A, the matrix norm of A is the maximum value of the ratio || Av || / || v ||, where v is any nonzero vector. The matrix norm of A is denoted || A || and depends on (or "subordinate to") a vector norm. The matrix norms that are used most often in practice are as follows:

  • The matrix 1-norm, which is the maximum over the sum of the absolute values of each column
  • The Frobenius norm, which is the square root of the sum of the squares of the elements.
  • The matrix 2-norm, which is the largest singular value of the matrix. The 2-norm is also called the spectral norm.
  • The matrix ∞-norm, which is the maximum over the sum of the absolute values of each row

Both the Frobenius norm and the matrix 2-norm are subordinate to the vector 2-norm. It turns out that the Frobenius norm is equal to the square root of the sum of squares of the singular values of a matrix. Thus it is similar to the matrix 2-norm: the 2-norm is the largest singular value, whereas the Frobenius norm involves all of the singular values.

Here are some simple SAS/IML computations for the matrix that appears in the Wikipedia article on matrix norm:

A = {3 5 7,
     2 6 4,
     0 2 8};
mL1 = norm(A, "L1");                      /* = max(A[+,])  */
mFrob = norm(A, "Frobenius");             /* = sqrt(A[##]) */
mL2 = norm(A, "L2");           /* = largest singular value */
mLinf = norm(A, "Linf");                  /* = max(A[,+])  */
print mL1 mFrob mL2 mLinf;

As I said, matrix norms give a measure of how much a nonzero vector can be stretched when transformed by a matrix. A matrix with a unit norm does not increase the length of vectors. A matrix with a large norm is called an ill-conditioned matrix. An ill-conditioned matrix can take a unit-length vector and stretch it by a large amount. Thus small uncertainties in the domain vector get magnified and lead to large uncertainties in the range. Consequently, matrix norms are used in numerical linear algebra to study how errors propagate during matrix computations.

Post a Comment

A different way to interpret the negative binomial distribution

While at SAS Global Forum 2014 I attended a talk by Jorge G. Morel on the analysis of data with overdispersion. (His slides are available, along with a video of his presentation.) The Wikipedia defines overdispersion as "greater variability than expected from a simple model." For count data, the "simple model" is the Poisson model, for which the variance equals the mean. Morel pointed out that the variance of count data is often greater than is permitted under a Poisson model, so more complicated models must be used.

One model that is used is the negative binomial model, for which the variance is greater than the mean. One of Morel's slides (Slide 43) mentions that the Poisson and NB distributions are closely related: you can view the negative binomial as a Poisson(λ) distribution, where λ is a gamma-distributed random variable. A distribution that has a random parameter is called a compound distribution.

In school I learned that the negative binomial is the distribution of the number of failures before k successes in a sequence of independent Bernoulli trials, which doesn't seem to be related to a Poisson distribution. To better understand Morel's talk, I wrote a SAS/IML program that simulates a random variable drawn from a Poisson(λ) distribution, where λ is drawn from a gamma distribution. The program shows off a feature of SAS/IML 12.1, namely that you can pass vectors of parameters as arguments to the RANDGEN subroutine. Here is the program:

/* Simulate X from a compound distribution: 
   X ~ Poisson(lambda), where lambda ~ gamma(a,b).
  Parameters from Morel's talk at SASGF14.
  E(X) = mu = a*b = 10; Var(X) = mu*(1+mu/a) which is greater than mu */
proc iml;
N = 1000; a = 5; b = 2;
call randseed(12345);
lambda = j(N,1); 
call randgen(lambda, "Gamma", a, b); /* lambda ~ Gamma(a,b) */
x = j(N,1);
call randgen(x, "Poisson", lambda);  /* 12.1 feature: pass vector of parameters */

The distribution of these simulated data agrees with the theoretical NB distribution (Slide 44 of Morel's presentation). In the following graph, the bar chart is the empirical distribution of the simulated data, whereas the circles represent the NB probability density. The graph was created by using the techniques in Chapter 3 of my book Simulating Data with SAS.

Upon doing further reading I learned that the compound distribution definition of the NB is favored by the insurance industry. Each policy holder has a different risk of getting in an accident, so the rate of claims (the λ parameter) is treated as random. The gamma distribution is a flexible way to model the distribution of risks in the population.

Of course, SAS enables you to sample directly from the negative binomial distribution, but that requires the traditional parameterization in terms of failures and the probability of success in a Bernoulli trial. This formulation in terms of a compound distribution enables you to use parameters that are more interpretable in certain applications.

Post a Comment

SAS/IML available to all students through SAS Analytics U

When spontaneous applause broke out during Dr. Jim Goodnight's presentation at the opening session of SAS Global Forum 2014, I was one of the people cheering the loudest.

The SAS CEO had just announced free software for students and professors at universities around the world. The SAS University Edition will provide access to several core SAS products, including Base SAS, SAS/STAT, SAS/ACCESS for PC File Formats, and SAS/IML software.

This academic program is called SAS Analytics U. Not only does it include free access to software and a programming environment that runs in Web browsers, but SAS has also set up a virtual community for the academic users, called the SAS Analytics U Online Community.

Why did I cheer so loudly?

  • I am passionate about education, and the SAS University Edition will help students learn mathematical and computational skills for analyzing data.
  • I am passionate about online communities that bring SAS users together. The SAS Analytics U Online Community provides a site where faculty and students can interact, share best practices, and ask and answer programming questions.
  • I am passionate about SAS/IML software. Students can use matrix computations to implement formulas and algorithms that are found in textbooks. Professors and advanced students will be able to compute statistics that extend the capabilities of SAS software.

There is an old proverb:

Tell me, I’ll forget.
Show me, I’ll remember.
Involve me, I’ll understand.

At SAS Global Forum 2014, Dr. Goodnight told people about SAS Analytics U and the availability of SAS/IML software. In my blog, I show people how to program in the SAS/IML language. The SAS University Edition will involve students in learning data analysis and programming.

As the proverb says, involve them and they will understand. Well done, SAS! Here comes the next generation of SAS/IML programmers.

Post a Comment

Color scatter plot markers by values of a continuous variable in SAS

When I visualize three-dimensional data, I prefer to use interactive graphics. For example, I often use the rotating plot in SAS/IML Studio (shown at the left) to create a three-dimensional scatter plot. The interactive plot enables me to rotate the cloud of points and to use a pointer to select and query the values of interesting points.

However, in blog posts, conference proceedings, and slideshow presentations, I often need to display a static visualization of three-dimensional data. Of course I can display a static snapshot of the rotating plot, as I've done here, but there are other options, including using the G3D procedure in SAS/GRAPH software to create a static 3-D scatter plot of the data.

A third option is to draw a two-dimensional scatter plot and color the observations by the value of a third variable. This is a useful technique in many situations, such as visualizing the relationship between two variables while indicating the value of a third variable. The following 2-D scatter plot shows the same data as in the 3-D rotating plot at the top of this article:

The data are from the documentation for the GAM procedure in SAS/STAT software and depict an experiment in which the yield of a chemical reaction is plotted against two control variables. The temperature of the solution and the amount of catalyst added to the solution were both varied systematically and independently on a uniform grid of values. From this scatter plot you can quickly see that the yield tends to be high when the temperature is in the 120–103 range and the amount of catalyst is between 0.04 and 0.07.

Coloring markers by a continuous variable

It is easy to color markers according to the value of a discrete variable: use the GROUP= option on the SCATTER statement in PROC SGPLOT. But how can you create the previous scatter plot by using the SG procedures in SAS?

As of SAS 9.4, the SGPLOT procedure does not enable you to assign colors to markers based on a continuous variable. However, you can use the Graph Template Language (GTL) to create a template that creates the plot. The trick is to use the MARKERCOLORGRADIENT= and COLORMODEL= options on the SCATTERPLOT statement to associate colors with values of a continuous variable. The following template creates a scatter plot with markers that are colored according to a blue-red color ramp:

/* create a GTL template that displays a scatter plot with markers 
   colored according to values of a continuous variable */
proc template;
  define statgraph gradientplot;
  dynamic _X _Y _Z _T;
  mvar LEGENDTITLE "optional title for legend";
    begingraph; 
      entrytitle _T; 
      layout overlay; 	 
        scatterplot x=_X y=_Y / 
          markercolorgradient=_Z colormodel=(BLUE RED)
          markerattrs=(symbol=SquareFilled size=12) name="scatter";
        continuouslegend "scatter" / title=LEGENDTITLE;
      endlayout;	
    endgraph;
  end;
run;
 
%let LegendTitle = "Yield";
proc sgrender data=ExperimentA template=gradientplot;
   dynamic _X='Temperature' _Y='Catalyst' _Z='Yield' _T='Raw Data';
run;

A few comments on the GTL template:

  • The MVAR statement enables you to use macro variables in your graphs. When the SGRENDER procedure is called, the legend title will be set to the value of the LegendTitle macro, if the variable is defined.
  • The three variables in the graph are dynamic variables (_X, _Y, and _Z) that are specified when you call PROC SGRENDER. The title of the graph (_T ) is similarly specified.
  • The MARKERCOLORGRADIENT= option is used to assign marker colors according to values of the _Z variable.
  • The COLORMODEL= option is used to specify a color ramp. I've hard-coded a blue-red color ramp, but other options are possible.
  • The CONTINUOUSLEGEND statement is used to display the color ramp on the graph so that the reader can associate colors to values.

Tip: The plot will suffer from overplotting if there are two or more observations that have the same (x, y) coordinates but different z coordinates. You can still use this technique, but you might want to sort the data by the response variable. This will create a plot where the high values of the response variable are apparent because they are plotted on top of the lower values. For example, if the purpose of your plot is to demonstrate that light cars with small engines are more fuel efficient than larger vehicles, sort the Sashelp.Cars data set by the MPG_City variable before you create the scatter plot, as follows:

proc sort data=Sashelp.Cars out=Cars;   by MPG_City;  run;
 
%let LegendTitle = "MPG City";
proc sgrender data=Cars template=gradientplot;
   dynamic _X='Horsepower' _Y='Weight' _Z='MPG_City' _T='Fuel Efficiency';
run;

You can download the data and the program that creates these scatter plots. This GTL template is easily modified to support other color ramps, transparent markers, and othe options. Enjoy!

Post a Comment

Video: What's new in SAS/IML 13.1

SAS/IML 13.1 shipped a few months ago. I was asked to produce a video that highlights some of the new features in SAS/IML 13.1. In this video I describe several changes to the language before introducing the new built-in subroutines that create ODS statistical graphs.



If your browser does not support embedded video, you can go directly to the video on YouTube. For more details, see "What's New in SAS/IML 13.1" in the SAS/IML User's Guide.

By the way, did you know that there is a SAS Statistics and Operations channel on YouTube that contains dozens of videos? Many of the videos are of presentations given by SAS staff at conferences such as SAS Global Forum. Enjoy!

Post a Comment