Compute the square root matrix

Children in primary school learn that every positive number has a real square root. The number x is a square root of s, if x2 = s.

Did you know that matrices can also have square roots? For certain matrices S, you can find another matrix X such that X*X = S. To give a very simple example, if S = a*I is a multiple of the identity matrix and a > 0, then X = ±sqrt(a)*I is a square root matrix.

A matrix square root

I'm going to restrict this article to real numbers, so from now on when I say "a number" I mean a real number and when I say "a matrix" I mean a real matrix. Negative numbers do not have square roots, so it is not surprising that not every matrix has a square root. For example, the negative identity matrix (-I) does not have a square root matrix.

All positive numbers have square roots, and mathematicians, who love to generalize everything, have defined a class of matrices with properties that are reminiscent of positive numbers. They are called positive definite matrices, and they arise often in statistics because every covariance and correlation matrix is symmetric and positive definite (SPD).

It turns out that if S is a symmetric positive definite matrix, then there exists a unique SPD matrix X, (called the square root matrix) such that X2 = A. For a proof, see Golub and van Loan, 3rd edition, 1996, p. 149. Furthermore, the following iterative algorithm converges quadratically to the square root matrix (ibid, p. 571):

  1. Let X0=I be the first approximation.
  2. Apply the formula X1 = (X0 + S*X0-1) / 2 where X-1 is the inverse matrix of X. The matrix X1 is a better approximation to sqrt(S) than X0.
  3. Apply the formula Xn+1 = (Xn + S*Xn-1) / 2 iteratively until the process converges. Convergence is achieved when a matrix norm ∥ Xn+1 – Xn ∥ is as small as you like.

The astute reader will recognize this algorithm as the matrix version of the Babylonian method for computing a square root. As I explained last week, this iterative method implements Newton's method to find the roots of the (matrix) function f(X) = X*X - S.

Compute a matrix square root in SAS

In SAS, the SAS/IML matrix language is used to carry out matrix computations. To illustrate the square root algorithm, let S be the 7x7 Toeplitz matrix that is generated by the vector {4 3 2 1 0 -1 -2}. I have previously shown that this Toeplitz matrix (and others of this general form) are SPD. The following SAS/IML program implements the iterative procedure for computing the square root of an SPD matrix:

proc iml;
/* Given an SPD matrix S, this function to compute the square root matrix
   X such that X*X = S */
start sqrtm(S, maxIters=100, epsilon=1e-6);
   X = I(nrow(S));             /* initial starting matrix */
   do iter = 1 to maxIters while( norm(X*X - S) > epsilon );
      X = 0.5*(X + S*inv(X));  /* Newton's method converges to square root of S */
   end;
   if norm(X*X - S) <= epsilon then  return X;
   else  return(.);
finish;
 
S = toeplitz(4:-2);            /* 7x7 SPD example */
X = sqrtm(S);                  /* square root matrix */
print X[L="sqrtm(S)" format=7.4];
sqrtMatrix

The output shows the square root matrix. If you multiply this matrix with itself, you get the original Toeplitz matrix. Notice that the original matrix and the square root matrix can contain negative elements, which shows that "positive definite" is different from "has all positive entries."

Functions of matrices

The square root algorithm can be thought of as a mapping that takes an SPD matrix and produces the square root matrix. Therefore it is an example of a function of a matrix. Nick Higham (2008) has written a book Functions of Matrices: Theory and Computation, which contains many other functions of matrices (exp(), log(), cos(), sin(),....) and how to compute the functions accurately. SAS/IML supports the EXPMATTRIX function, which computes the exponential function of a matrix.

The square root algorithm in this post is a simplified version of a more robust algorithm that has better numerical properties. Higham (1986), "Newton's Method for the Matrix Square Root" cautions that Newton's iteration can be numerically unstable for certain matrices. (For details, see p. 543, Eqns 3.11 and 3.12.) Higham suggests an alternate (but similar) routine (p. 544) that is only slightly more expensive but has improved stability properties.

I think it is very cool that the ancient Babylonian algorithm for computing square roots of numbers can be generalized to compute the square root of a matrix. However, notice that there is an interesting difference. In the Babylonian algorithm, you are permitted to choose any positive number to begin the square root algorithm. For matrices, the initial guess must commute with the matrix S (Higham, 1986, p. 540). Therefore a multiple of the identity matrix is the safest choice for an initial guess.

Post a Comment

How to fit a variety of logistic regression models in SAS

SAS software can fit many different kinds of regression models. In fact a common question on the SAS Support Communities is "how do I fit a <name> regression model in SAS?" And within that category, the most frequent questions involve how to fit various logistic regression models in SAS.

There are many types of logistic regression models:

  • The traditional logistic model has a binary (or binomial) response variable.
  • Well-known extensions of the logistic model include ordinal regression (for an ordinal response) and multinomial regression (for a discrete but unordered response).
  • Special models handle situations such as repeated measures (longitudinal data) or random effects.
  • Other models include conditional logistic regression, survey logistic regression, Bayesian logistic regression, and fractional logistic regression (which gets the Coolest Name Award).

If you know the procedure that you want to use, you can read the procedure documentation and follow the examples to perform the analyses. In practice, however, you might know the reverse information: You know the analysis that you want to perform, but you do not know which SAS procedure implements that regression model.

I was therefore pleased to discover a short SAS Knowledge Base article titled "Types of logistic (or logit) models that can be fit using SAS." This article provides a "reverse look-up": Given the type of logistic regression model that you want to fit, it provides the name of the SAS procedures that support that analysis!

Use this resource the next time you need to know which SAS procedure can conduct a certain variant of logistic regression. Bookmark it, print it out, tattoo it on your forearm, or come to my blog and type "HOW TO FIT A LOGISTIC MODEL" into the search box. But don't forget it! This resource is a treasure for the SAS statistical programmer.

Post a Comment

All I really need to know about Newton's method I learned in primary school

I was eleven years old when I first saw Newton's method.

No, I didn't go to a school for geniuses. I didn't even know it was Newton's method until decades later. However, in sixth grade I learned an iterative algorithm that taught me (almost) everything I need to know about Newton's method for finding the roots of functions.

Newton's method and an iterative square root algorithm

The algorithm I learned in the sixth grade is an iterative procedure for computing square roots by hand. It seems like magic because it estimates a square root from an arbitrary guess. Given a positive number S, you can guess any positive number x0 and then apply the "magic formula" xn+1 = (xn + S / xn)/2 until the iterates converge to the square root of S. For details, see my article about the Babylonian algorithm.

It turns out that this Babylonian square-root algorithm is a special case of Newton's general method for finding the roots of functions. To see this, define the function f(x) = x2 - S. Notice that the square root of S is the positive root of f. Newton's method says that you can find roots by forming the function Q(x) = x - f(x) / f′(x), where f′ is the derivative of f, which is 2x. With a little bit of algebra, you can show that Q(x) = (x + S / x)/2, which is exactly the "magic formula" in the Babylonian algorithm.

Therefore, the square root algorithm is exactly Newton's method applied to a special quadratic polynomial whose root is the square root of S. The Newton iteration process is visualized by the following graph (click to enlarge), which shows the iteration history of the initial guess 10 when S = 20.

Iteration of initial guess under Babylonian square-root algorithm

Everything I need to know about Newton's method...

It turns out that almost everything I need to know about Newton's method, I learned in sixth grade. The Babylonian method for square roots provides practical tips about how to use Newton's method for finding roots:

  • You need to make an initial guess.
  • If you make a good initial guess, the iteration process is shorter.
  • When you get close to a solution, the number of correct digits doubles for each iteration. This is quadratic convergence in action!
  • Avoid inputs where the function is not defined. For the Babylonian method, you can't plug in x=0. In Newton's method, you need to avoid inputs where the derivative is close to zero.
  • Different initial guesses might iterate to different roots. In the square root algorithm, a negative initial guess converges to the negative square root.

In fact, I'll go further. The Babylonian method teaches important lessons in numerical analysis:

  • Root-finding is a fundamental tool that helps you solve all kinds of numerical problems.
  • Many problems that do not have exact answers can be solved with iterative methods.
  • Iterative methods often linearize the problem and use the linear formulation as part of the algorithm.

So you see, all I really needed to know about Newton's method I learned in sixth grade!

Post a Comment

The Babylonian method for finding square roots by hand

When I was in the sixth grade, I learned an iterative procedure for computing square roots by hand. Yes, I said by hand. Scientific calculators with a square root key were not yet widely available, so I and previous generations of children suffered through learning to calculate square roots by hand.

I still remember being amazed when I first saw the iterative square root algorithm. It was the first time that I thought math is magical. The algorithm required that I make an initial guess for the square root. I then applied a "magic formula" a few times. The magic formula improved my guess and estimated the square root that I sought.

The Babylonian square-root algorithm

The iterative method is called the Babylonian method for finding square roots, or sometimes Hero's method. It was known to the ancient Babylonians (1500 BC) and Greeks (100 AD) long before Newton invented his general procedure.

Here's how it works. Suppose you are given any positive number S. To find the square root of S, do the following:

  1. Make an initial guess. Guess any positive number x0.
  2. Improve the guess. Apply the formula x1 = (x0 + S / x0) / 2. The number x1 is a better approximation to sqrt(S).
  3. Iterate until convergence. Apply the formula xn+1 = (xn + S / xn) / 2 until the process converges. Convergence is achieved when the digits of xn+1 and xn agree to as many decimal places as you desire.

Let's use this algorithm to compute the square root of S = 20 to at least two decimal places.

  1. An initial guess is x0 = 10.
  2. Apply the formula: x1 = (10 + 20/10)/2 = 6. The number 6 is a better approximation to sqrt(20).
  3. Apply the formula again to obtain x2 = (6 + 20/6)/2 = 4.66667. The next iterations are x3 = 4.47619 and x4 = 4.47214.

Because x3 and x4 agree to two decimal places, the algorithm ends after four iterations. An estimate for sqrt(20) is 4.47214.

How do you choose an initial guess?

You can choose any positive value as an initial guess. However, when I was a kid I used to race my friends to see who could find the square root the fastest. I discovered that if you choose a good guess, then you have to compute only a few iterations. I invented a strategy for finding a good guess, which I call "The Rule of Twos and Sevens."

The Rule of Twos and Sevens chooses an initial guess from among the candidates {2, 7, 20, 70, 200, 700, ...}. The Rule use the fact that a square root has about half as many integer digits as the number itself. The Rule follows and is illustrated by using S = 3249.

  1. Count the integer digits in S. For example, S = 3249 has four digits.
  2. Consider the candidates that have half as many digits as S. (If S has an odd number of digits, round up the number of digits.) For example, 20 and 70 are two-digit candidates. Use those candidates when the number S has three or four integer digits.
  3. Use mental arithmetic to decide which candidate is better. For example, 20 is the square root of 400 whereas 70 is the square root of (about) 5000. Because S is closer to 5000, choose 70 as the starting guess. (Challenge: Convince yourself that you can use 20 for all three-digit integers, 200 for all five-digit integers, and so forth.)

In other words, a good guess starts with a 2 or a 7 and has about half as many digits as are in the whole-number part of S. My experience is that The Rule of Twos and Sevens usually converges to a solution (within two decimal places) in four or fewer iterations.

For small numbers, you can choose the initial guess more precisely. If S is less than 225 (=15x15), you can bracket the square root by using the perfect squares {1, 4, 9, ..., 196, 225} and then use the corresponding integer in the range [1, 15] as an initial guess.

Implement the Babylonian square-root algorithm in SAS

In tribute to all students who ever struggled to perform this algorithm by hand, I implemented the Babylonian square-root algorithm in SAS. You can implement the algorithm directly as a DATA step program, but I chose to use PROC FCMP to define new DATA step functions.

proc fcmp outlib=work.funcs.MathFuncs;
/* Rule of 2s and 7s: Count the number of digits in S.
   Choose initial guess to have half as many digits 
   (rounded up) and start with a 2 or 7. */
function BabylonianGuess(S);      /* provide initial guess */
   str = put(floor(S), 16.);      /* convert [S] to string */
   L = length(strip(str));        /* count how many digits */
   d = ceil(L/2);                 /* about half as many digits (round up) */
   guess2 = 2*10**(d-1);
   guess7 = 7*10**(d-1);
   if abs(guess2**2 - S) < abs(guess7**2 - S) then
      return( guess2 );
   else 
      return( guess7 );    
endsub;
 
/* the Babylonian method (aka, Hero's method) for finding square roots */
function BabylonianSqrt(S);
   epsilon = 100*constant("maceps"); /* convergence criterion */
   if S < 0 then x = .;              /* handle negative numbers */
   else if S=0 then x = 0;           /* handle zero */
   else do;
      x = BabylonianGuess(S);     /* initial guess */
      xPrev = 0;
      do while (abs(x - xPrev) > epsilon);
         xPrev = x;
         x = (xPrev + S/xPrev)/2; /* iterate to improve guess */
      end;
   end;
   return( x );
endsub;
quit;
 
/* Functions defined. Compare Babylonian algorithm to modern SQRT function */
options cmplib=work.funcs;        /* define location of functions */
data Compare;
   input S @@;
   BabySqrt = BabylonianSqrt(S);  /* Babylonian algorithm */
   Sqrt = sqrt(S);                /* modern computation */
   Diff = abs(BabySqrt - Sqrt);   /* compare values */
datalines;
-3 0 1 2 4 10 16 30 100 3249 125348
;
 
proc print label; label BabySqrt = "Babylonian Sqrt"; run;
Comparison of Babylonian square-root algorithm with modern SQRT function

Notice that the Babylonian method produces essentially the same numerical value as the built-in SQRT function in SAS. Of course, the iterative BabylonianSqrt function is not nearly as efficient as the built-in SQRT function.

For more information about the Babylonian algorithm and other algorithms for computing square roots, see Brown (1999) Coll. Math J.

Do you remember learning a square root algorithm in school? Do you still remember it? Are there other "ancient" algorithms that you have learned that are now unnecessary because of modern technology? Leave a comment.

Post a Comment

Ten tips before you run an optimization

Optimization is a primary tool of computational statistics. SAS/IML software provides a suite of nonlinear optimizers that makes it easy to find an optimum for a user-defined objective function. You can perform unconstrained optimization, or define linear or nonlinear constraints for constrained optimization.

Over the years I have seen many questions about optimization (which is also call nonlinear programming (NLP)) posted to discussion forums such as the SAS/IML Support Community. A typical question describes a program that is producing an error, will not converge, or is running exceedingly slow. Such problems can be exceedingly frustrating, and some programmers express their frustration by punctuating their questions with multiple exclamation points:

  • "Why is the Newton-Raphson (NLPNRA) routine broken?!!!"
  • "Why is SAS/IML taking forever to solve this problem????!!!!!"

To paraphrase Shakespeare, in most cases the fault lies not in our optimization routines, but in ourselves. The following checklist describes 10 common SAS/IML issues that lead to errors or lack of convergence in nonlinear optimization. If you check this list before attempting to optimize a function, then you increase the chances that the SAS/IML optimization routines will produce an optimal solution quickly and efficiently.

  1. Thoroughly test and debug the objective function before you optimize. This is the most common reason that people post questions to a discussion forum. The programmer reports that the optimization routine is giving an error message, but the real problem is that there is a logical error in the objective function. When the optimizer calls the objective function, BLAM!—the program stops with an error. The solution is to test the objective function before you call the optimizer. Make sure that the objective function produces correct results for a range of possible input values.

  2. Vectorize the objective function, when possible. The objective function will be called many times. Therefore make sure it is as efficient as possible. In a matrix-vector language such as SAS/IML, you should vectorize, which means replacing loops by using a vector computation.

  3. Provide a good initial guess. Unfortunately, many programmers don't put much thought into the initial guess. They use a vector of zeros or ones or assign some arbitrary values. The initial guess is important. Newton's method converges quadratically to a solution if the initial guess is close to the solution. A good guess might converge in a few iterations, whereas a bad guess might not converge at all or might require dozens of iterations. You can use graphical techniques to find an initial guess. If you are running a constrained optimization, ensure that the initial guess satisfies the constraints.

  4. Specify derivatives correctly. Some optimizers (such as Newton's method) use derivative information to find an optimum of a smooth objective function. If your optimization is running slowly or not converging, you might want to specify the analytical derivatives. I've written a previous article about how to specify derivatives and ensure that they are correct. If you do not write a derivative function, SAS/IML will use finite differences. Finite differences require evaluating the objective function many times, which can be slow. For objective functions that are extremely flat near the optimum, finite differences can be inaccurate or lead to convergence issues.

  5. Use global variables to specify constants. In the SAS/IML language, the objective function for the NLP routines accepts a single vector of parameters. The optimizer tries to adjust the values of those parameters in order to optimize the objective function, subject to any constraints. But sometimes the objective function requires additional (constant) parameters. For example, in maximum likelihood estimation, the objective function needs access to the data. When you define the objective function, use the GLOBAL statement to specify the data and other unchanging parameters.

  6. Check your data and your model. Sometimes there is nothing wrong with the objective function, it is the data that is causing the problem. In maximum likelihood estimation (MLE), the data is part of the optimization. Degenerate data, constant data, collinearities, and singular matrices can arise when you attempt to fit an inappropriate model to data. Lack of convergence could mean that the model does not fit the data. Furthermore, not every MLE problem has a solution! Use a combination of research and visualization to make sure that a solution exists and is not degenerate.

  7. During the debugging phase, print the iteration history. When you use the NLP routines in SAS/IML, you create a vector of options that controls the optimizer algorithm. The second element of this vector specifies the amount of output that you want to create. When I am debugging a problem, I usually specify a value of 4, which means that the optimizer prints the iteration history and other details. By examining the output, you can often discover whether the initial guess is converging to an optimal value, is bouncing around, or is diverging. If the optimization is not converging, examining the iteration history is often the first step to understanding why.

  8. Define constraints correctly. In many statistical problems, parameters are constrained. For example, in maximum likelihood estimation, scale parameters must be positive. The simplest linear-constraint matrix has two rows and a column for each parameter. The first row represents the lower bound for each parameter and the second row represents the upper bound. Put the value 0 in the first row for parameters that must be nonnegative, or use a small number such as 1E-6 if you need to bound the parameter away from zero. The SAS/IML documentation shows how you can add additional rows to specify additional constraints. For nonlinear constraints, you can use the NLC= option to specify a module that defines the feasible region.

  9. Trap out-of-domain conditions in the objective function. When you use the NLC= option to define nonlinear constraints, the final solution should lie in the feasible region. However, the optimizer is entitled to evaluate the objective function in the infeasible region. Consequently, if the objective function contains logs, square roots, or other functions that are undefined in some regions of parameter space, you should trap and handle bad parameter values within the objective function. If you are maximizing the objective function, return a large negative value (such as -1E6) when the input parameters are outside the domain of the function.

  10. Do not attempt 10,000 optimizations until you can successfully complete one optimization. Sometimes optimizations are part of a simulation study. Make sure that the optimization is robust and completes successfully for one simulated sample. Then try five samples. Then 100. Time the performance of your simulation on 100 samples and use that time to estimate the run time for 10,000 samples. If you encounter a simulated sample for which the optimization does not converge, save that sample so that you can carefully analyze it and discover why the optimization failed. Remember that you can check the return code from the NLP routines to determine if the optimization converged.

Optimization can be challenging, but you can increase the chance of success if you follow the tips and strategies in this checklist. These tips help you to avoid common pitfalls. By following these suggestions, your SAS/IML optimization will be more efficient, more accurate, and less frustrating.

Post a Comment

What is a DATA step view and why is it important?

Last week I analyzed 12 million records of taxi cab transactions in New York City. As part of that analysis, I used a DATA step view to create a new variable, which was the ratio of the tip amount to the fare amount.

A novice SAS programmer told me that he has never heard of a "DATA step view." He asked, "What is a DATA step view?"

Simply put, a "view" is a SAS DATA step program that is stored for later execution. When the SAS language processor encounters the RUN statement, the program is compiled and saved, but not executed. When a procedure uses a data view, the program runs and serves data to the procedure as if the procedure were reading a regular SAS data set. Thus you can use a view to manipulate data "on the fly."

I like to create views when I want to construct a new variable in a huge data set, but I don't want to physically copy the data. When I analyze the data view by using another procedure, the constructed variable is computed on the fly.

Here's an example. Suppose that you have a large data set that includes heights and weights for millions of patients. Some (but perhaps not all) analysts at your company need to analyze the body-mass index (BMI) for these patients. You have two options. The first option is to create a new data set that has the new column in it. This requires that you duplicate the original data and add a new column. The second option is to keep the original data unchanged, but create a view that computes the BMI. Any analyst that needs the BMI can access the data by using the view.

Let's see how this works by using a small data set. The Sashelp.Class data set contains height and weight (in English measurements) for 19 children. The BMI formula for children is slightly more complicated than for adults, but for simplicity the following SAS program simply uses the adult formula. The following DATA step creates a data view by using the VIEW= option on the DATA statement. The MEANS procedure then analyzes the newly calculated BMI variable:

data BMI / view=BMI;                 /* define DATA step view */
   set Sashelp.Class;
   BMI = weight / height**2 * 703;   /* BMI formula for adults (pounds and inches) */
run;
 
proc means data=BMI;
   var BMI;
run;
dataview

As you can see, the syntax for the MEANS procedure does not change. In fact, the only syntax that changes is the VIEW= option in the DATA step.

When SAS encounters the RUN statement in the DATA step, it saves the program. The program is not executed until it is used in the DATA= option in PROC MEANS. At that point, SAS executes the program and computes the BMI variable, which the procedure consumes and analyzes.

There are three main advantage of a data view: reduced storage space, not cluttering up a data set with extra variables, and if the view uses data that is updated regularly (for example, nightly sales data) then the view always reads the current data.

The main disadvantage to DATA step views is that the computed columns must be recomputed every time that a procedure uses the view. If you specify DATA=BMI for additional procedures, the BMI variable is recomputed each time, which is slower than reading pre-computed data.

For more information about SAS data views, see

Do you use DATA step views in your company? Leave a comment.

Post a Comment

Create a package in SAS/IML

In a previous post I showed how to download, install, and use packages in SAS/IML 14.1. SAS/IML packages incorporate source files, documentation, data sets, and sample programs into a ZIP file. The PACKAGE statement enables you to install, uninstall, and manage packages. You can load functions and data into your current IML session and thereby leverage the work of the experts who wrote the package.

Do you have some awesome SAS/IML functions to share? Are you an expert in some subject area? This article describes how to create a SAS/IML package for others to use.

The following list shows the main steps to create a SAS/IML package. Each step is demonstrated by using the polygon package that I created for the paper "Writing Packages: A New Way to Distribute and Use SAS/IML Programs." You can download the polygon package and examine its files and structure.

1. Create a root-level directory

The first step is to think of a name for your SAS/IML package. The name must be a valid SAS name, which means it must contain 32 characters or less, begin with a letter or underscore, and contain only letters, underscores, and digits. Create a directory that has this name in all lowercase letters. For example, the root-level directory for the polygon package is named polygon.

2. Create the info.txt file in the root-level directory

The info.txt file is a manifest. It provides information about the names of the source files in the package. It consists of a series of keyword-value pairs. The SAS/IML User's Guide provides complete details about the form of the info.txt file, but an example file follows:

# SAS/IML Package Information File Format 1.0
Name:        polygon
Description: Computational geometry for polygons
Author:      Rick Wicklin <Rick.Wicklin@sas.com>
Version:     1.0
RequiresIML: 14.1
SourceFiles: PolyArea.iml
             PolyCentroid.iml
             PolyDraw.iml
             <...list additional files...>

The first line of the info.txt file specifies the file format. Use exactly the line shown here. The value of the Name keyword specifies the name of the package. When you create a ZIP file (Step 6), be sure to match the case exactly.

3. Create the source files

For most SAS/IML packages, the functionality is provided by a series of related functions. Put all source files in the source subdirectory.

A source file usually contains one or more module definitions. The source files cannot contain the PROC IML statement or the QUIT statement because those statement would cause an IML session to exit when the source file is read. Source files are read in the order in which they are specified in the info.txt file.

I like to put each major function in its own source file, although sometimes I include associated helper function in the same file. I begin function names with a short prefix that identifies the package to which the functions belong. For example, most functions in the polygon package begin with the prefix "POLY". Internal-only functions (that it, those that are not publically documented) in the package begin with the prefix "_POLY".

I like to use ".iml" as a file extension to remind me that these are snippets of IML programs. However, you can use ".sas" as an extension if you prefer.

4. Document the SAS/IML package

Even if your package is very useful, people won't want to use it if there isn't sufficient documentation about how to use it. You can provide two kinds of documentation. The main documentation is usually a PDF file in the help subdirectory. You might also include slideshows, Word documents, or any other useful files. The secondary documentation is a plain (ASCII) text file that has the same name as your package. For example, the polygon package contains the file polygon.txt in the help subdirectory. This file is echoed to the SAS log when a user installs the package and then submits the PACKAGE HELP polygon statement.

5. Create other files and subdirectories

Sometimes the best way to show someone how to use a package is to write a driver program that loads the package, calls the functions, and displays the results. Driver programs and other sample programs should be put in the programs subdirectory.

For the polygon package, I included the file Example.sas, which shows how to load the package and call each function in the package. I also included other programs that test the functions. For example, the drawing function (POLYDRAW) has many options, so I wanted to show how to use each option. I also show how to define and use non-simple polygons, which are polygons that have edge intersection.

Another sure-fire way to make sure that users know how to use your functions is to include sample data. If you have data sets to distribute, create a subdirectory named data. You can put SAS data sets, CSV files, and other sources of data into that directory.

If you have other files to distribute, create additional subdirectories. For example, you might have a C directory if you include C files or an R subdirectory if you include R files.

6. Create a ZIP file

At this point, the directories and files for the polygon package looks like the following:

C:%HOMEPATH%\My Documents\My SAS Files\polygon
|   info.txt
|
+---data
|   simple.sas7bdat, states48.sas7bdat
|
+---help
|   polygon.docx, polygon.pdf, polygon.txt
|
+---programs
|   Example.sas, TestNonSimplePoly.sas, TestPolyDraw.sas, TestPolyPtInside.sas
|
\---source
    PolyArea.iml, PolyBoundingBox.iml, PolyCentroid.iml, ..., PolyRegular.iml

The last step is to run a compression utility to create a ZIP file that contains the root-level directory and all subdirectories. The name of the ZIP file should match the package name, including the case. For example, the ZIP file for the polygon package is named polygon.zip. The following image shows creating a ZIP file by using the WinZip utility. Other popular (and free) alternatives are 7-Zip, PeaZip, and PKZip.

Creating a ZIP file for a SAS/IML Package

7. Test and upload your package

Your SAS/IML package is now ready to be tested. Make sure that you can successfully use the PACKAGE INSTALL, PACKAGE LOAD, and PACKAGE HELP statements on your package. If you are distributing data sets, test the PACKAGE LIBNAME statement.

When you are satisfied that the package installs and loads correctly, you can share the package with others. To share your work in-house, ask your system administrator to load the package into the PUBLIC collection so that everyone in your workgroup can use it. If you want to make the package available to others outside of your company, upload the package the SAS/IML File Exchange. You can follow the directions in the article "How to contribute to the SAS/IML File Exchange."

Summary

Creating a package is a convenient way to share your work with others in your company or around the world. It enables other SAS/IML programmers to leverage your expertise and programming skills. And who knows, it just might make you famous!

For complete details about how to create a package, see the "Packages" chapter of the SAS/IML User's Guide.

Post a Comment

How much do New Yorkers tip taxi drivers?

New York taxis: tip ersus fare amount

When I read Robert Allison's article about the cost of a taxi ride in New York City, I was struck by the scatter plot (shown at right; click to enlarge) that plots the tip amount against the total bill for 12 million taxi rides. The graph clearly reveals diagonal and horizontal trends in the data. The apparent "lines" in the scatter plot correspond to three kinds of riders:

  • The horizontal lines correspond to riders who tip a set amount regardless of the fare amount. There are strong visible trends for $5, $10, $15, and $20. There are weaker trends at other whole-dollar amounts.
  • The diagonal lines that pass through the origin are riders whose tip is a percentage of the fare. There are strong visible trends at 15%, 20%, 25%, and even 30%.
  • The diagonal lines that slope down from the upper left are riders who give the driver a set amount and tell him to keep the change. There are clear lines in the full-sized graph for riders whose total bill (fare plus tip) was $20, $40, $50, $60, $70, and $100.

The graph made me wonder how much a typical New Yorker tips as a percentage of the bill.

Tips and percentages

You can download the data from Robert Allison's post. This article analyzes the same 12 million records (from January, 2015) that Robert featured.

The first step is simply to use a SAS DATA step to compute the percentage of the tip. The following program uses a data VIEW to save disk space. You can then use PROC MEANS in Base SAS to compute percentiles for the tip amounts:

libname taxi "<path to data>";
 
data tax / view=tax;
   label tip_pct = "Tip Percent";
   set taxi.ny_taxi_data_2015_01;
   where 0 < fare_amount <= 100 & 0 <= tip_amount <= 100;
   total_bill = total_amount - tip_amount ;
   tip_pct = tip_amount  / total_bill;
   keep fare_amount tip_amount tip_pct total_bill;
run;
 
title "Tip as Percentage of Total Fare";
proc means data=tax P40 P50 P75 P90 P95 P99;
   var tip_pct;
run;
taxitips1

The output shows some surprising facts. According to these data on NYC Yellow Taxis:

  • More than 40% of all riders do not tip.
  • The median tip is 13.3% of the fare.
  • Twenty-five percent of all riders tip 20% or more.
  • Five percent of all riders tip 25% or more.

Edit: (03MAY2016) A reader notes in the comments that 40% of riders not tipping seems excessively high. He suggests that the data could be biased by a systematic underreporting of tips and suggests we need a better understanding of data collection process. I went to the "data dictionary" for this data and discovered that the tip_amount value "is automatically populated for credit card tips. Cash tips are not included.” Ah-hah! The reader is correct! There is a systematic bias in the data.

The distribution of tip percentages

You can use a histogram to show the distribution of tip percentages. The distribution has a long tail, so the following call to PROC UNIVARIATE truncates the distribution at 100%.

proc univariate data=tax(obs=1000000);
   format tip_pct PERCENT7.0;
   where tip_pct <= 1;
   var tip_pct;
   histogram tip_pct / endpoints=(0 to 1 by 0.05) statref=Q1 Median Q3
             odstitle="Percent of Fare Left as Tip for NYC Yellow Taxis";
run;
Distribution of Taxi Tips as Percentage of Fare

The histogram appears to be a mixture distribution. The first bar (about 41%) represents the riders who leave either no tip or a negligible tip. [Edit: Many of these are cash transactions for which the tip is not reported.] Then there is a "hump," which appears to be normally distributed, that represents riders who leave a tip that is between 5% and 40% of the fare. As you might expect, the mode of the hump is between 15% and 20%, which is the cultural norm in the US. The last part of the distribution is the long tail, which represents generous individuals whose tip is a substantial percentage of the fare.

The vertical lines show the 25th percentile (at 0%), the 50th percentile (at 13%), and the 75th percentile (at 20%).

Examining the central part of the distribution

The distribution of tips is dominated by the large proportion of taxi riders who do not tip. (Or, to be more cynical, the proportion of rides for which the reported tip is zero.) This section excludes the non-tippers and analyzes only those riders who leave a non-zero tip.

Let's repeat the analyses of the previous sections, but this time exclude the non-tippers. First, look at the quantiles. Notice the WHERE clause to exclude the non-tippers.

title "Tip Percentages Conditional on Tipping";
proc means data=tax Q1 Median Q3 P90 P95 P99;
   where tip_amount > 0;   /* exclude non-tippers */
   var tip_pct;
run;
taxitips3

Ah! That's more like what I expected to see! For riders who leave a tip, the median and 75th percentiles are about 20%. About one in ten New Yorkers that leave a tip leave 25% or more.

Visualizing the distribution of tipping percentages among those who tip

For a final graph, let's visualize the distribution of non-zero tip amounts. I could draw the same histogram as before, but for fun I will use the three-panel visualization in SAS that I created a few years ago. In the three-panel visualization, a histogram is stacked in a column with a box plot and a normal quantile-quantile plot (Q-Q plot). Each panel reveals slightly different aspects of the distribution. You can download the %ThreePanel macro from my previous article. The following graph shows the three-panel visualization (click to enlarge) of one million tippers:

%ThreePanel(tax(obs=1000000 where=(tip_amount > 0 && tip_pct <= 1)), tip_pct);
Three-panel distribution of taxi tips

As compared to the previous histogram, this histogram uses a smaller bin width and therefore shows small-scale structure that was not previously apparent. The bumps in the histogram and kernel density estimate show that many riders tip 20%, 25%, and 30%. The box plot shows a relatively narrow interquartile range (the box) and a many riders whose tip percentages are well-above or well-below the median (the outliers).

Lastly, consider the Q-Q plot. If the points fall near the diagonal line, then the distribution is approximately normal. That is not the case here. The middle of the distribution falls somewhat near the diagonal line, but the tails do not. By the usual interpretation of a Q-Q plot, the lower- and upper tails of the distribution are decidedly nonnormal. The middle of the distribution is approximately normal, but the staircase structure of the Q-Q plot means that certain values are repeated many times (20%, 25%, and 30%).

I conjecture that the lower tail is from the "keep the change" riders who rounded the fare to a whole-dollar amount that was few pennies more than the fare. The upper tail is likely from generous riders who substantially rounded up the fare, perhaps paying $20 bill to cover a $13 fare.

In summary, a short analysis of 12 million New York taxi rides indicates that slightly more than 40% of riders do not leave any tip. [EDIT: This large percentage appears to be from cash transactions for which the tip amount is not collected.] A small percentage leave a tiny tip, but the bulk of the tippers leave tips in the 16% to 20% range. About 10% of riders leave as much as 25% or 30% for a tip, and a small number of riders leave tips that correspond to larger percentages.

Post a Comment

Packages: A new way to share SAS/IML programs

My previous post highlighted presentations at SAS Global Forum 2016 that heavily used SAS/IML software. Several of the authors clearly want to share their work with the wider SAS analytical community. They include their SAS/IML program in an appendix or mention a web site or email address from which the reader can obtain the SAS/IML functions that implement the analysis.

As of SAS/IML 14.1, there is an easy way to share SAS/IML functions with others. The IML language now supports packages through the PACKAGE statement. A package is a collection of files that contains source code, documentation, example programs, and sample data. Packages are distributed as ZIP files.

This article shows how to install and use a package. A subsequent article will discuss how to author a package. For more information, see my 2016 SAS Global Forum paper, "Writing Packages: A New Way to Distribute and Use SAS/IML Programs".

The main steps to use a package are as follows:

  1. Download the ZIP file that contains a package.
  2. To install the package, use the PACKAGE INSTALL statement from within PROC IML or SAS/IML Studio.
  3. To learn how to use the package, read the package documentation and run some of the sample programs that the package provides.
  4. To use the functions in the package, use the PACKAGE LOAD statement to load the functions into your current SAS/IML session.
  5. To access sample data that the package provides, use the PACKAGE LIBNAME statement.

In the following sections, each step is illustrated by using the polygon package, which contains functions that compute geometric properties of planar polygons. You can download the package from the SAS/IML File Exchange.

Obtain the package

The SAS/IML File Exchange is the recomended location to obtain packages. However, some authors might choose to post packages on GitHub or on their own web site.

Download the polygon package and save it to a location that SAS can access. This article assumes that the ZIP file is saved to the location C:\Packages\polygon.zip on a Windows PC.

Install the package

To install the package, run the following statements in SAS/IML 14.1 or later:

proc iml;
package install "C:\Packages\polygon.zip"; 
quit;

You only need to install a package one time. You can then call it in future PROC IML sessions. The package remains installed even after you exit from SAS and reboot your computer.

Read the documentation

The help directory in the ZIP file contains the full documentation for the package, often in the form of a PDF file. You can read that documentation by using a PDF reader outside of SAS. The PDF documentation might contain diagrams, equations, graphs, and other non-text information.

Authors should also provide a brief text-only version of the syntax. While running SAS, you can display the text-only version in the SAS log, as follows:

proc iml;
package help polygon;

For the polygon package, the SAS log contains the following overview of the package and the syntax for every public module.

Polygon Package
Description: Computational geometry for polygons
 
A polygon is represented as an n x 2 matrix, where the rows represent
adjacent vertices for the polygon. The polygon should be "open,"
which means that the last row does not equal the first row.
<...More text...>
 
Module Syntax:
PolyArea(P);
   returns the areas of simple polygons.
 
<...documentation for nine other functions...>
 
PolyWindingNumber(P, R);
   returns the winding number of the polygon R around the point R.

The purpose of the PACKAGE HELP syntax is to remind the user of the syntax of the functions in the package while inside of SAS.

Another way to learn to use a pakage is to examine and run sample programs that the package author provides. For the polygon package, you can look in the programs directory. The file Example.sas demonstrates calling functions in the package.

Load the functions

To read the functions in the package into the current IML session, use the PACKAGE LOAD statement, as follows:

package load polygon;

The SAS log will display a NOTE for each modules that is loaded:

NOTE: Module _POLYAREA defined.
NOTE: Module POLYAREA defined.
<...More text...>
NOTE: Module POLYSTACK defined.

Notice that all function begin with a common prefix, in this case "POLY" (or "_POLY," for internal-only functions). This is a good programming practice because it reduces the likelihood that functions in one package conflict with functions in another package. For example, if you load two packages that each have a function named "COMPUTE" (or an equally generic name), then there will be a comflict. By using "POLY" as a prefix, there is less likely to be a name collision.

Call the functions

After the functions are loaded, you can call them from your program. For example, the following statements define the four vertices of a rectangle and then call functions from the package to compute the perimeter, area, and whether the polygon is a convex region.

P = {0 0, 1 0, 1 2, 0 2};         /* vertices of rectangle */
Perimeter = PolyPerimeter(P);
Area      = PolyArea(P);
IsConvex  = PolyIsConvex(P);
print Perimeter Area IsConvex;
package1

Use sample data

Some packages include sample data sets. You can use the PACKAGE LIBNAME statement to create a SAS libref that points to the data subdirectory of an installed package. The following statements read in sample data for polygons that are defined in the Simple data set. After the vertices of the polygons are read, the PolyDraw subroutine is called to visualize the polygons in the data set.

package libname PolyData polygon;    /* define libref */
use PolyData.Simple;                 /* sample data in package */
read all var {u v ID} into P;        /* vertices; 3rd col is ID variable */
close PolyData.Simple;
run PolyDraw(P);                     /* visualize the polygons */
package2

Display package information

You can use the PACKAGE LIST statement to print the name and version of all installed packages. You can use the PACKAGE INFO command to obtain details about a specific package. The output for these statements is not shown.

package list;                /* list all installed packages */ 
package info polygon;        /* details about polygon package */

Summary

This article briefly describes how to install and use a SAS/IML package. Packages were introduced in SA/IML 14.1, which was released with SAS 9.4m3 in July 2015. This article shows how to download, install, load, and use packages that are written by others.

For additional details about packages, see Wicklin (2016). The official documentation of packages is contained in the "Packages" chapter in the SAS/IML User's Guide.

Post a Comment

Matrix computations at SAS Global Forum 2016

Last week I attended SAS Global Forum 2016 in Las Vegas. I and more than 5,000 other attendees discussed and shared tips about data analysis and statistics.

Naturally, I attended many presentations that featured using SAS/IML software to implement advanced analytical algorithms. Several speakers showed impressive mastery of SAS/IML programming techniques. In all, almost 30 papers stated that they used SAS/IML for some or all of the computations.

The following presentations were especially impressive. As you can see, topics include Bayesian analysis, signal processing, spatial analysis, simulation, and logistic regression.

  • Jason Bentley, a PhD student at The University of Sydney, presented a paper about Bayesian inference for complex hierarchical models with smoothing splines (joint work with Cathy Lee). Whereas Markov-chain Monte Carlo techniques are "computationally intensive, or even infeasible" for large hierarchical models, Bentley coded a "fast deterministic alternative to MCMC." His SAS/IML code runs in 22 seconds on a hierarchical model that includes 350 schools and approximately 88,000 students. Jason says that this was his first serious SAS/IML program, but that he found the language "ideal" for implementing the algorithm.
  • Woranat Wongdhamma at Oklahoma State University used wavelet analysis in SAS/IML to construct a predictive model that helps clinical researchers diagnose obstructive sleep apnea. I did not realize that 1 in 15 adult Americans have moderate or severe sleep apnea, and many of these people are undiagnosed. Hopefully Wongdhamma's analysis will help expedite the diagnosis process.
  • dasilva Alan Ricardo da Silva, a professor at the University of Brasilia, has written many papers that feature SAS/IML programs. This year (with T. Rodrigues), Alan showed how to implement an algorithm for geographically weighted negative binomial regression. He applied this spatial analysis method to freight transportation in Brazil. The image at right is taken from his paper.
  • A second paper by da Silva (with co-author Paulo da Silva) shows how to use SAS/IML to simulate data from a skew normal and a skew t distribution.
  • Gongwei Chen used SAS/IML to build a discrete-time Markov chain model to forecast workloads at the Washington State Department of Corrections. Social workers and payroll officers supervise convicted offenders who have been released from prison or were sentenced to supervision by the court system. Individuals who exhibit good behavior can transition from a highly supervised situation into less supervision. Other individuals might commit a new offense that requires an increase in supervision. Chen's probabilistic model helps the State of Washington forecast the resources needed for supervising offenders.
  • Katherine Cai and Jeffrey Wilson at Arizona State University used SAS/IML to build a logistic model of longitudinal data with time-dependent covariates. The algorithm was applied to obesity data and to children’s health data in the Philippines.
  • Another paper by Wilson (co-authored with Kyle Irimata) used SAS/IML to implement exact logistic regression for nested binary data.

The SAS/IML language was used in many other e-posters and presentations that unfortunately I could not attend due to other commitments. All of the papers for the proceedings are available online, so you can search the proceedings for "IML" or any other topic interests you.

Experienced SAS programmers recognize that when an analysis is not available in any SAS procedure, the SAS/IML language is often the best choice for implementing the analysis. These papers and my discussions with SAS customers indicate that the SAS/IML language is actively used for advanced analytical models that require simulation, optimization, and matrix computations. Because SAS/IML is part of the free SAS University Edition, which has been downloaded 500,000 times, I expect the number of SAS/IML users to continue to grow.

Post a Comment