The Babylonian method for finding square roots by hand

When I was in the sixth grade, I learned an iterative procedure for computing square roots by hand. Yes, I said by hand. Scientific calculators with a square root key were not yet widely available, so I and previous generations of children suffered through learning to calculate square roots by hand.

I still remember being amazed when I first saw the iterative square root algorithm. It was the first time that I thought math is magical. The algorithm required that I make an initial guess for the square root. I then applied a "magic formula" a few times. The magic formula improved my guess and estimated the square root that I sought.

The Babylonian square-root algorithm

The iterative method is called the Babylonian method for finding square roots, or sometimes Hero's method. It was known to the ancient Babylonians (1500 BC) and Greeks (100 AD) long before Newton invented his general procedure.

Here's how it works. Suppose you are given any positive number S. To find the square root of S, do the following:

  1. Make an initial guess. Guess any positive number x0.
  2. Improve the guess. Apply the formula x1 = (x0 + S / x0) / 2. The number x1 is a better approximation to sqrt(S).
  3. Iterate until convergence. Apply the formula xn+1 = (xn + S / xn) / 2 until the process converges. Convergence is achieved when the digits of xn+1 and xn agree to as many decimal places as you desire.

Let's use this algorithm to compute the square root of S = 20 to at least two decimal places.

  1. An initial guess is x0 = 10.
  2. Apply the formula: x1 = (10 + 20/10)/2 = 6. The number 6 is a better approximation to sqrt(20).
  3. Apply the formula again to obtain x2 = (6 + 20/6)/2 = 4.66667. The next iterations are x3 = 4.47619 and x4 = 4.47214.

Because x3 and x4 agree to two decimal places, the algorithm ends after four iterations. An estimate for sqrt(20) is 4.47214.

How do you choose an initial guess?

You can choose any positive value as an initial guess. However, when I was a kid I used to race my friends to see who could find the square root the fastest. I discovered that if you choose a good guess, then you have to compute only a few iterations. I invented a strategy for finding a good guess, which I call "The Rule of Twos and Sevens."

The Rule of Twos and Sevens chooses an initial guess from among the candidates {2, 7, 20, 70, 200, 700, ...}. The Rule use the fact that a square root has about half as many integer digits as the number itself. The Rule follows and is illustrated by using S = 3249.

  1. Count the integer digits in S. For example, S = 3249 has four digits.
  2. Consider the candidates that have half as many digits as S. (If S has an odd number of digits, round up the number of digits.) For example, 20 and 70 are two-digit candidates. Use those candidates when the number S has three or four integer digits.
  3. Use mental arithmetic to decide which candidate is better. For example, 20 is the square root of 400 whereas 70 is the square root of (about) 5000. Because S is closer to 5000, choose 70 as the starting guess. (Challenge: Convince yourself that you can use 20 for all three-digit integers, 200 for all five-digit integers, and so forth.)

In other words, a good guess starts with a 2 or a 7 and has about half as many digits as are in the whole-number part of S. My experience is that The Rule of Twos and Sevens usually converges to a solution (within two decimal places) in four or fewer iterations.

For small numbers, you can choose the initial guess more precisely. If S is less than 225 (=15x15), you can bracket the square root by using the perfect squares {1, 4, 9, ..., 196, 225} and then use the corresponding integer in the range [1, 15] as an initial guess.

Implement the Babylonian square-root algorithm in SAS

In tribute to all students who ever struggled to perform this algorithm by hand, I implemented the Babylonian square-root algorithm in SAS. You can implement the algorithm directly as a DATA step program, but I chose to use PROC FCMP to define new DATA step functions.

proc fcmp outlib=work.funcs.MathFuncs;
/* Rule of 2s and 7s: Count the number of digits in S.
   Choose initial guess to have half as many digits 
   (rounded up) and start with a 2 or 7. */
function BabylonianGuess(S);      /* provide initial guess */
   str = put(floor(S), 16.);      /* convert [S] to string */
   L = length(strip(str));        /* count how many digits */
   d = ceil(L/2);                 /* about half as many digits (round up) */
   guess2 = 2*10**(d-1);
   guess7 = 7*10**(d-1);
   if abs(guess2**2 - S) < abs(guess7**2 - S) then
      return( guess2 );
   else 
      return( guess7 );    
endsub;
 
/* the Babylonian method (aka, Hero's method) for finding square roots */
function BabylonianSqrt(S);
   epsilon = 100*constant("maceps"); /* convergence criterion */
   if S < 0 then x = .;              /* handle negative numbers */
   else if S=0 then x = 0;           /* handle zero */
   else do;
      x = BabylonianGuess(S);     /* initial guess */
      xPrev = 0;
      do while (abs(x - xPrev) > epsilon);
         xPrev = x;
         x = (xPrev + S/xPrev)/2; /* iterate to improve guess */
      end;
   end;
   return( x );
endsub;
quit;
 
/* Functions defined. Compare Babylonian algorithm to modern SQRT function */
options cmplib=work.funcs;        /* define location of functions */
data Compare;
   input S @@;
   BabySqrt = BabylonianSqrt(S);  /* Babylonian algorithm */
   Sqrt = sqrt(S);                /* modern computation */
   Diff = abs(BabySqrt - Sqrt);   /* compare values */
datalines;
-3 0 1 2 4 10 16 30 100 3249 125348
;
 
proc print label; label BabySqrt = "Babylonian Sqrt"; run;
Comparison of Babylonian square-root algorithm with modern SQRT function

Notice that the Babylonian method produces essentially the same numerical value as the built-in SQRT function in SAS. Of course, the iterative BabylonianSqrt function is not nearly as efficient as the built-in SQRT function.

For more information about the Babylonian algorithm and other algorithms for computing square roots, see Brown (1999) Coll. Math J.

Do you remember learning a square root algorithm in school? Do you still remember it? Are there other "ancient" algorithms that you have learned that are now unnecessary because of modern technology? Leave a comment.

Post a Comment

Ten tips before you run an optimization

Optimization is a primary tool of computational statistics. SAS/IML software provides a suite of nonlinear optimizers that makes it easy to find an optimum for a user-defined objective function. You can perform unconstrained optimization, or define linear or nonlinear constraints for constrained optimization.

Over the years I have seen many questions about optimization (which is also call nonlinear programming (NLP)) posted to discussion forums such as the SAS/IML Support Community. A typical question describes a program that is producing an error, will not converge, or is running exceedingly slow. Such problems can be exceedingly frustrating, and some programmers express their frustration by punctuating their questions with multiple exclamation points:

  • "Why is the Newton-Raphson (NLPNRA) routine broken?!!!"
  • "Why is SAS/IML taking forever to solve this problem????!!!!!"

To paraphrase Shakespeare, in most cases the fault lies not in our optimization routines, but in ourselves. The following checklist describes 10 common SAS/IML issues that lead to errors or lack of convergence in nonlinear optimization. If you check this list before attempting to optimize a function, then you increase the chances that the SAS/IML optimization routines will produce an optimal solution quickly and efficiently.

  1. Thoroughly test and debug the objective function before you optimize. This is the most common reason that people post questions to a discussion forum. The programmer reports that the optimization routine is giving an error message, but the real problem is that there is a logical error in the objective function. When the optimizer calls the objective function, BLAM!—the program stops with an error. The solution is to test the objective function before you call the optimizer. Make sure that the objective function produces correct results for a range of possible input values.

  2. Vectorize the objective function, when possible. The objective function will be called many times. Therefore make sure it is as efficient as possible. In a matrix-vector language such as SAS/IML, you should vectorize, which means replacing loops by using a vector computation.

  3. Provide a good initial guess. Unfortunately, many programmers don't put much thought into the initial guess. They use a vector of zeros or ones or assign some arbitrary values. The initial guess is important. Newton's method converges quadratically to a solution if the initial guess is close to the solution. A good guess might converge in a few iterations, whereas a bad guess might not converge at all or might require dozens of iterations. You can use graphical techniques to find an initial guess. If you are running a constrained optimization, ensure that the initial guess satisfies the constraints.

  4. Specify derivatives correctly. Some optimizers (such as Newton's method) use derivative information to find an optimum of a smooth objective function. If your optimization is running slowly or not converging, you might want to specify the analytical derivatives. I've written a previous article about how to specify derivatives and ensure that they are correct. If you do not write a derivative function, SAS/IML will use finite differences. Finite differences require evaluating the objective function many times, which can be slow. For objective functions that are extremely flat near the optimum, finite differences can be inaccurate or lead to convergence issues.

  5. Use global variables to specify constants. In the SAS/IML language, the objective function for the NLP routines accepts a single vector of parameters. The optimizer tries to adjust the values of those parameters in order to optimize the objective function, subject to any constraints. But sometimes the objective function requires additional (constant) parameters. For example, in maximum likelihood estimation, the objective function needs access to the data. When you define the objective function, use the GLOBAL statement to specify the data and other unchanging parameters.

  6. Check your data and your model. Sometimes there is nothing wrong with the objective function, it is the data that is causing the problem. In maximum likelihood estimation (MLE), the data is part of the optimization. Degenerate data, constant data, collinearities, and singular matrices can arise when you attempt to fit an inappropriate model to data. Lack of convergence could mean that the model does not fit the data. Furthermore, not every MLE problem has a solution! Use a combination of research and visualization to make sure that a solution exists and is not degenerate.

  7. During the debugging phase, print the iteration history. When you use the NLP routines in SAS/IML, you create a vector of options that controls the optimizer algorithm. The second element of this vector specifies the amount of output that you want to create. When I am debugging a problem, I usually specify a value of 4, which means that the optimizer prints the iteration history and other details. By examining the output, you can often discover whether the initial guess is converging to an optimal value, is bouncing around, or is diverging. If the optimization is not converging, examining the iteration history is often the first step to understanding why.

  8. Define constraints correctly. In many statistical problems, parameters are constrained. For example, in maximum likelihood estimation, scale parameters must be positive. The simplest linear-constraint matrix has two rows and a column for each parameter. The first row represents the lower bound for each parameter and the second row represents the upper bound. Put the value 0 in the first row for parameters that must be nonnegative, or use a small number such as 1E-6 if you need to bound the parameter away from zero. The SAS/IML documentation shows how you can add additional rows to specify additional constraints. For nonlinear constraints, you can use the NLC= option to specify a module that defines the feasible region.

  9. Trap out-of-domain conditions in the objective function. When you use the NLC= option to define nonlinear constraints, the final solution should lie in the feasible region. However, the optimizer is entitled to evaluate the objective function in the infeasible region. Consequently, if the objective function contains logs, square roots, or other functions that are undefined in some regions of parameter space, you should trap and handle bad parameter values within the objective function. If you are maximizing the objective function, return a large negative value (such as -1E6) when the input parameters are outside the domain of the function.

  10. Do not attempt 10,000 optimizations until you can successfully complete one optimization. Sometimes optimizations are part of a simulation study. Make sure that the optimization is robust and completes successfully for one simulated sample. Then try five samples. Then 100. Time the performance of your simulation on 100 samples and use that time to estimate the run time for 10,000 samples. If you encounter a simulated sample for which the optimization does not converge, save that sample so that you can carefully analyze it and discover why the optimization failed. Remember that you can check the return code from the NLP routines to determine if the optimization converged.

Optimization can be challenging, but you can increase the chance of success if you follow the tips and strategies in this checklist. These tips help you to avoid common pitfalls. By following these suggestions, your SAS/IML optimization will be more efficient, more accurate, and less frustrating.

Post a Comment

What is a DATA step view and why is it important?

Last week I analyzed 12 million records of taxi cab transactions in New York City. As part of that analysis, I used a DATA step view to create a new variable, which was the ratio of the tip amount to the fare amount.

A novice SAS programmer told me that he has never heard of a "DATA step view." He asked, "What is a DATA step view?"

Simply put, a "view" is a SAS DATA step program that is stored for later execution. When the SAS language processor encounters the RUN statement, the program is compiled and saved, but not executed. When a procedure uses a data view, the program runs and serves data to the procedure as if the procedure were reading a regular SAS data set. Thus you can use a view to manipulate data "on the fly."

I like to create views when I want to construct a new variable in a huge data set, but I don't want to physically copy the data. When I analyze the data view by using another procedure, the constructed variable is computed on the fly.

Here's an example. Suppose you have a large data set that includes heights and weights for millions of patients. Some (but perhaps not all) analysts at your company need to analyze the body-mass index (BMI) for these patients. You have two options. The first option is to create a new data set that has the new column in it. This requires that you duplicate the original data and add a new column. The second option is to keep the original data unchanged, but create a view that computes the BMI. Any analyst that needs the BMI can access the data by using the view.

Let's see how this works by using a small data set. The Sashelp.Class data set contains height and weight (in English measurements) for 19 children. The BMI formula for children is slightly more complicated than for adults, but for simplicity the following SAS program simply uses the adult formula. The following DATA step creates a data view by using the VIEW= option on the DATA statement. The MEANS procedure then analyzes the newly calculated BMI variable:

data BMI / view=BMI;                 /* define DATA step view */
   set Sashelp.Class;
   BMI = weight / height**2 * 703;   /* BMI formula for adults (pounds and inches) */
run;
 
proc means data=BMI;
   var BMI;
run;
dataview

As you can see, the syntax for the MEANS procedure does not change. In fact, the only syntax that changes is the VIEW= option in the DATA step.

When SAS encounters the RUN statement in the DATA step, it saves the program. The program is not executed until it is used in the DATA= option in PROC MEANS. At that point, SAS executes the program and computes the BMI variable, which the procedure consumes and analyzes.

There are three main advantage of a data view: reduced storage space, not cluttering up a data set with extra variables, and if the view uses data that is updated regularly (for example, nightly sales data) then the view always reads the current data.

The main disadvantage to DATA step views is that the computed columns must be recomputed every time that a procedure uses the view. If you specify DATA=BMI for additional procedures, the BMI variable is recomputed each time, which is slower than reading pre-computed data.

For more information about SAS data views, see

Do you use DATA step views in your company? Leave a comment.

Post a Comment

Create a package in SAS/IML

In a previous post I showed how to download, install, and use packages in SAS/IML 14.1. SAS/IML packages incorporate source files, documentation, data sets, and sample programs into a ZIP file. The PACKAGE statement enables you to install, uninstall, and manage packages. You can load functions and data into your current IML session and thereby leverage the work of the experts who wrote the package.

Do you have some awesome SAS/IML functions to share? Are you an expert in some subject area? This article describes how to create a SAS/IML package for others to use.

The following list shows the main steps to create a SAS/IML package. Each step is demonstrated by using the polygon package that I created for the paper "Writing Packages: A New Way to Distribute and Use SAS/IML Programs." You can download the polygon package and examine its files and structure. The polygon package implements functions for computing features of planar polygons, such as computing a centroid.

1. Create a root-level directory

The first step is to think of a name for your SAS/IML package. The name must be a valid SAS name, which means it must contain 32 characters or less, begin with a letter or underscore, and contain only letters, underscores, and digits. Create a directory that has this name in all lowercase letters. For example, the root-level directory for the polygon package is named polygon.

2. Create the info.txt file in the root-level directory

The info.txt file is a manifest. It provides information about the names of the source files in the package. It consists of a series of keyword-value pairs. The SAS/IML User's Guide provides complete details about the form of the info.txt file, but an example file follows:

# SAS/IML Package Information File Format 1.0
Name:        polygon
Description: Computational geometry for polygons
Author:      Rick Wicklin <Rick.Wicklin@sas.com>
Version:     1.0
RequiresIML: 14.1
SourceFiles: PolyArea.iml
             PolyCentroid.iml
             PolyDraw.iml
             <...list additional files...>

The first line of the info.txt file specifies the file format. Use exactly the line shown here. The value of the Name keyword specifies the name of the package. When you create a ZIP file (Step 6), be sure to match the case exactly.

3. Create the source files

For most SAS/IML packages, the functionality is provided by a series of related functions. Put all source files in the source subdirectory.

A source file usually contains one or more module definitions. The source files cannot contain the PROC IML statement or the QUIT statement because those statement would cause an IML session to exit when the source file is read. Source files are read in the order in which they are specified in the info.txt file.

I like to put each major function in its own source file, although sometimes I include associated helper function in the same file. I begin function names with a short prefix that identifies the package to which the functions belong. For example, most functions in the polygon package begin with the prefix "POLY". Internal-only functions (that it, those that are not publically documented) in the package begin with the prefix "_POLY".

I like to use ".iml" as a file extension to remind me that these are snippets of IML programs. However, you can use ".sas" as an extension if you prefer.

4. Document the SAS/IML package

Even if your package is very useful, people won't want to use it if there isn't sufficient documentation about how to use it. You can provide two kinds of documentation. The main documentation is usually a PDF file in the help subdirectory. You might also include slideshows, Word documents, or any other useful files. The secondary documentation is a plain (ASCII) text file that has the same name as your package. For example, the polygon package contains the file polygon.txt in the help subdirectory. This file is echoed to the SAS log when a user installs the package and then submits the PACKAGE HELP polygon statement.

5. Create other files and subdirectories

Sometimes the best way to show someone how to use a package is to write a driver program that loads the package, calls the functions, and displays the results. Driver programs and other sample programs should be put in the programs subdirectory.

For the polygon package, I included the file Example.sas, which shows how to load the package and call each function in the package. I also included other programs that test the functions. For example, the drawing function (POLYDRAW) has many options, so I wanted to show how to use each option. I also show how to define and use non-simple polygons, which are polygons that have edge intersection.

Another sure-fire way to make sure that users know how to use your functions is to include sample data. If you have data sets to distribute, create a subdirectory named data. You can put SAS data sets, CSV files, and other sources of data into that directory.

If you have other files to distribute, create additional subdirectories. For example, you might have a C directory if you include C files or an R subdirectory if you include R files.

6. Create a ZIP file

At this point, the directories and files for the polygon package looks like the following:

C:%HOMEPATH%\My Documents\My SAS Files\polygon
|   info.txt
|
+---data
|   simple.sas7bdat, states48.sas7bdat
|
+---help
|   polygon.docx, polygon.pdf, polygon.txt
|
+---programs
|   Example.sas, TestNonSimplePoly.sas, TestPolyDraw.sas, TestPolyPtInside.sas
|
\---source
    PolyArea.iml, PolyBoundingBox.iml, PolyCentroid.iml, ..., PolyRegular.iml

The last step is to run a compression utility to create a ZIP file that contains the root-level directory and all subdirectories. The name of the ZIP file should match the package name, including the case. For example, the ZIP file for the polygon package is named polygon.zip. The following image shows creating a ZIP file by using the WinZip utility. Other popular (and free) alternatives are 7-Zip, PeaZip, and PKZip.

Creating a ZIP file for a SAS/IML Package

7. Test and upload your package

Your SAS/IML package is now ready to be tested. Make sure that you can successfully use the PACKAGE INSTALL, PACKAGE LOAD, and PACKAGE HELP statements on your package. If you are distributing data sets, test the PACKAGE LIBNAME statement.

When you are satisfied that the package installs and loads correctly, you can share the package with others. To share your work in-house, ask your system administrator to load the package into the PUBLIC collection so that everyone in your workgroup can use it. If you want to make the package available to others outside of your company, upload the package the SAS/IML File Exchange. You can follow the directions in the article "How to contribute to the SAS/IML File Exchange."

Summary

Creating a package is a convenient way to share your work with others in your company or around the world. It enables other SAS/IML programmers to leverage your expertise and programming skills. And who knows, it just might make you famous!

For complete details about how to create a package, see the "Packages" chapter of the SAS/IML User's Guide.

Post a Comment

How much do New Yorkers tip taxi drivers?

New York taxis: tip ersus fare amount

When I read Robert Allison's article about the cost of a taxi ride in New York City, I was struck by the scatter plot (shown at right; click to enlarge) that plots the tip amount against the total bill for 12 million taxi rides. The graph clearly reveals diagonal and horizontal trends in the data. The apparent "lines" in the scatter plot correspond to three kinds of riders:

  • The horizontal lines correspond to riders who tip a set amount regardless of the fare amount. There are strong visible trends for $5, $10, $15, and $20. There are weaker trends at other whole-dollar amounts.
  • The diagonal lines that pass through the origin are riders whose tip is a percentage of the fare. There are strong visible trends at 15%, 20%, 25%, and even 30%.
  • The diagonal lines that slope down from the upper left are riders who give the driver a set amount and tell him to keep the change. There are clear lines in the full-sized graph for riders whose total bill (fare plus tip) was $20, $40, $50, $60, $70, and $100.

The graph made me wonder how much a typical New Yorker tips as a percentage of the bill.

Tips and percentages

You can download the data from Robert Allison's post. This article analyzes the same 12 million records (from January, 2015) that Robert featured.

The first step is simply to use a SAS DATA step to compute the percentage of the tip. The following program uses a data VIEW to save disk space. You can then use PROC MEANS in Base SAS to compute percentiles for the tip amounts:

libname taxi "<path to data>";
 
data tax / view=tax;
   label tip_pct = "Tip Percent";
   set taxi.ny_taxi_data_2015_01;
   where 0 < fare_amount <= 100 & 0 <= tip_amount <= 100;
   total_bill = total_amount - tip_amount ;
   tip_pct = tip_amount  / total_bill;
   keep fare_amount tip_amount tip_pct total_bill;
run;
 
title "Tip as Percentage of Total Fare";
proc means data=tax P40 P50 P75 P90 P95 P99;
   var tip_pct;
run;
taxitips1

The output shows some surprising facts. According to these data on NYC Yellow Taxis:

  • More than 40% of all riders do not tip.
  • The median tip is 13.3% of the fare.
  • Twenty-five percent of all riders tip 20% or more.
  • Five percent of all riders tip 25% or more.

Edit: (03MAY2016) A reader notes in the comments that 40% of riders not tipping seems excessively high. He suggests that the data could be biased by a systematic underreporting of tips and suggests we need a better understanding of data collection process. I went to the "data dictionary" for this data and discovered that the tip_amount value "is automatically populated for credit card tips. Cash tips are not included.” Ah-hah! The reader is correct! There is a systematic bias in the data.

The distribution of tip percentages

You can use a histogram to show the distribution of tip percentages. The distribution has a long tail, so the following call to PROC UNIVARIATE truncates the distribution at 100%.

proc univariate data=tax(obs=1000000);
   format tip_pct PERCENT7.0;
   where tip_pct <= 1;
   var tip_pct;
   histogram tip_pct / endpoints=(0 to 1 by 0.05) statref=Q1 Median Q3
             odstitle="Percent of Fare Left as Tip for NYC Yellow Taxis";
run;
Distribution of Taxi Tips as Percentage of Fare

The histogram appears to be a mixture distribution. The first bar (about 41%) represents the riders who leave either no tip or a negligible tip. [Edit: Many of these are cash transactions for which the tip is not reported.] Then there is a "hump," which appears to be normally distributed, that represents riders who leave a tip that is between 5% and 40% of the fare. As you might expect, the mode of the hump is between 15% and 20%, which is the cultural norm in the US. The last part of the distribution is the long tail, which represents generous individuals whose tip is a substantial percentage of the fare.

The vertical lines show the 25th percentile (at 0%), the 50th percentile (at 13%), and the 75th percentile (at 20%).

Examining the central part of the distribution

The distribution of tips is dominated by the large proportion of taxi riders who do not tip. (Or, to be more cynical, the proportion of rides for which the reported tip is zero.) This section excludes the non-tippers and analyzes only those riders who leave a non-zero tip.

Let's repeat the analyses of the previous sections, but this time exclude the non-tippers. First, look at the quantiles. Notice the WHERE clause to exclude the non-tippers.

title "Tip Percentages Conditional on Tipping";
proc means data=tax Q1 Median Q3 P90 P95 P99;
   where tip_amount > 0;   /* exclude non-tippers */
   var tip_pct;
run;
taxitips3

Ah! That's more like what I expected to see! For riders who leave a tip, the median and 75th percentiles are about 20%. About one in ten New Yorkers that leave a tip leave 25% or more.

Visualizing the distribution of tipping percentages among those who tip

For a final graph, let's visualize the distribution of non-zero tip amounts. I could draw the same histogram as before, but for fun I will use the three-panel visualization in SAS that I created a few years ago. In the three-panel visualization, a histogram is stacked in a column with a box plot and a normal quantile-quantile plot (Q-Q plot). Each panel reveals slightly different aspects of the distribution. You can download the %ThreePanel macro from my previous article. The following graph shows the three-panel visualization (click to enlarge) of one million tippers:

%ThreePanel(tax(obs=1000000 where=(tip_amount > 0 && tip_pct <= 1)), tip_pct);
Three-panel distribution of taxi tips

As compared to the previous histogram, this histogram uses a smaller bin width and therefore shows small-scale structure that was not previously apparent. The bumps in the histogram and kernel density estimate show that many riders tip 20%, 25%, and 30%. The box plot shows a relatively narrow interquartile range (the box) and a many riders whose tip percentages are well-above or well-below the median (the outliers).

Lastly, consider the Q-Q plot. If the points fall near the diagonal line, then the distribution is approximately normal. That is not the case here. The middle of the distribution falls somewhat near the diagonal line, but the tails do not. By the usual interpretation of a Q-Q plot, the lower- and upper tails of the distribution are decidedly nonnormal. The middle of the distribution is approximately normal, but the staircase structure of the Q-Q plot means that certain values are repeated many times (20%, 25%, and 30%).

I conjecture that the lower tail is from the "keep the change" riders who rounded the fare to a whole-dollar amount that was few pennies more than the fare. The upper tail is likely from generous riders who substantially rounded up the fare, perhaps paying $20 bill to cover a $13 fare.

In summary, a short analysis of 12 million New York taxi rides indicates that slightly more than 40% of riders do not leave any tip. [EDIT: This large percentage appears to be from cash transactions for which the tip amount is not collected.] A small percentage leave a tiny tip, but the bulk of the tippers leave tips in the 16% to 20% range. About 10% of riders leave as much as 25% or 30% for a tip, and a small number of riders leave tips that correspond to larger percentages.

Post a Comment

Packages: A new way to share SAS/IML programs

My previous post highlighted presentations at SAS Global Forum 2016 that heavily used SAS/IML software. Several of the authors clearly want to share their work with the wider SAS analytical community. They include their SAS/IML program in an appendix or mention a web site or email address from which the reader can obtain the SAS/IML functions that implement the analysis.

As of SAS/IML 14.1, there is an easy way to share SAS/IML functions with others. The IML language now supports packages through the PACKAGE statement. A package is a collection of files that contains source code, documentation, example programs, and sample data. Packages are distributed as ZIP files.

This article shows how to install and use a package. A subsequent article will discuss how to author a package. For more information, see my 2016 SAS Global Forum paper, "Writing Packages: A New Way to Distribute and Use SAS/IML Programs".

The main steps to use a package are as follows:

  1. Download the ZIP file that contains a package.
  2. To install the package, use the PACKAGE INSTALL statement from within PROC IML or SAS/IML Studio.
  3. To learn how to use the package, read the package documentation and run some of the sample programs that the package provides.
  4. To use the functions in the package, use the PACKAGE LOAD statement to load the functions into your current SAS/IML session.
  5. To access sample data that the package provides, use the PACKAGE LIBNAME statement.

In the following sections, each step is illustrated by using the polygon package, which contains functions that compute geometric properties of planar polygons. You can download the package from the SAS/IML File Exchange.

Obtain the package

The SAS/IML File Exchange is the recomended location to obtain packages. However, some authors might choose to post packages on GitHub or on their own web site.

Download the polygon package and save it to a location that SAS can access. This article assumes that the ZIP file is saved to the location C:\Packages\polygon.zip on a Windows PC.

Install the package

To install the package, run the following statements in SAS/IML 14.1 or later:

proc iml;
package install "C:\Packages\polygon.zip"; 
quit;

You only need to install a package one time. You can then call it in future PROC IML sessions. The package remains installed even after you exit from SAS and reboot your computer.

Read the documentation

The help directory in the ZIP file contains the full documentation for the package, often in the form of a PDF file. You can read that documentation by using a PDF reader outside of SAS. The PDF documentation might contain diagrams, equations, graphs, and other non-text information.

Authors should also provide a brief text-only version of the syntax. While running SAS, you can display the text-only version in the SAS log, as follows:

proc iml;
package help polygon;

For the polygon package, the SAS log contains the following overview of the package and the syntax for every public module.

Polygon Package
Description: Computational geometry for polygons
 
A polygon is represented as an n x 2 matrix, where the rows represent
adjacent vertices for the polygon. The polygon should be "open,"
which means that the last row does not equal the first row.
<...More text...>
 
Module Syntax:
PolyArea(P);
   returns the areas of simple polygons.
 
<...documentation for nine other functions...>
 
PolyWindingNumber(P, R);
   returns the winding number of the polygon R around the point R.

The purpose of the PACKAGE HELP syntax is to remind the user of the syntax of the functions in the package while inside of SAS.

Another way to learn to use a pakage is to examine and run sample programs that the package author provides. For the polygon package, you can look in the programs directory. The file Example.sas demonstrates calling functions in the package.

Load the functions

To read the functions in the package into the current IML session, use the PACKAGE LOAD statement, as follows:

package load polygon;

The SAS log will display a NOTE for each modules that is loaded:

NOTE: Module _POLYAREA defined.
NOTE: Module POLYAREA defined.
<...More text...>
NOTE: Module POLYSTACK defined.

Notice that all function begin with a common prefix, in this case "POLY" (or "_POLY," for internal-only functions). This is a good programming practice because it reduces the likelihood that functions in one package conflict with functions in another package. For example, if you load two packages that each have a function named "COMPUTE" (or an equally generic name), then there will be a comflict. By using "POLY" as a prefix, there is less likely to be a name collision.

Call the functions

After the functions are loaded, you can call them from your program. For example, the following statements define the four vertices of a rectangle and then call functions from the package to compute the perimeter, area, and whether the polygon is a convex region.

P = {0 0, 1 0, 1 2, 0 2};         /* vertices of rectangle */
Perimeter = PolyPerimeter(P);
Area      = PolyArea(P);
IsConvex  = PolyIsConvex(P);
print Perimeter Area IsConvex;
package1

Use sample data

Some packages include sample data sets. You can use the PACKAGE LIBNAME statement to create a SAS libref that points to the data subdirectory of an installed package. The following statements read in sample data for polygons that are defined in the Simple data set. After the vertices of the polygons are read, the PolyDraw subroutine is called to visualize the polygons in the data set.

package libname PolyData polygon;    /* define libref */
use PolyData.Simple;                 /* sample data in package */
read all var {u v ID} into P;        /* vertices; 3rd col is ID variable */
close PolyData.Simple;
run PolyDraw(P);                     /* visualize the polygons */
package2

Display package information

You can use the PACKAGE LIST statement to print the name and version of all installed packages. You can use the PACKAGE INFO command to obtain details about a specific package. The output for these statements is not shown.

package list;                /* list all installed packages */ 
package info polygon;        /* details about polygon package */

Summary

This article briefly describes how to install and use a SAS/IML package. Packages were introduced in SA/IML 14.1, which was released with SAS 9.4m3 in July 2015. This article shows how to download, install, load, and use packages that are written by others.

For additional details about packages, see Wicklin (2016). The official documentation of packages is contained in the "Packages" chapter in the SAS/IML User's Guide.

Post a Comment

Matrix computations at SAS Global Forum 2016

Last week I attended SAS Global Forum 2016 in Las Vegas. I and more than 5,000 other attendees discussed and shared tips about data analysis and statistics.

Naturally, I attended many presentations that featured using SAS/IML software to implement advanced analytical algorithms. Several speakers showed impressive mastery of SAS/IML programming techniques. In all, almost 30 papers stated that they used SAS/IML for some or all of the computations.

The following presentations were especially impressive. As you can see, topics include Bayesian analysis, signal processing, spatial analysis, simulation, and logistic regression.

  • Jason Bentley, a PhD student at The University of Sydney, presented a paper about Bayesian inference for complex hierarchical models with smoothing splines (joint work with Cathy Lee). Whereas Markov-chain Monte Carlo techniques are "computationally intensive, or even infeasible" for large hierarchical models, Bentley coded a "fast deterministic alternative to MCMC." His SAS/IML code runs in 22 seconds on a hierarchical model that includes 350 schools and approximately 88,000 students. Jason says that this was his first serious SAS/IML program, but that he found the language "ideal" for implementing the algorithm.
  • Woranat Wongdhamma at Oklahoma State University used wavelet analysis in SAS/IML to construct a predictive model that helps clinical researchers diagnose obstructive sleep apnea. I did not realize that 1 in 15 adult Americans have moderate or severe sleep apnea, and many of these people are undiagnosed. Hopefully Wongdhamma's analysis will help expedite the diagnosis process.
  • dasilva Alan Ricardo da Silva, a professor at the University of Brasilia, has written many papers that feature SAS/IML programs. This year (with T. Rodrigues), Alan showed how to implement an algorithm for geographically weighted negative binomial regression. He applied this spatial analysis method to freight transportation in Brazil. The image at right is taken from his paper.
  • A second paper by da Silva (with co-author Paulo da Silva) shows how to use SAS/IML to simulate data from a skew normal and a skew t distribution.
  • Gongwei Chen used SAS/IML to build a discrete-time Markov chain model to forecast workloads at the Washington State Department of Corrections. Social workers and parole officers supervise convicted offenders who have been released from prison or were sentenced to supervision by the court system. Individuals who exhibit good behavior can transition from a highly supervised situation into less supervision. Other individuals might commit a new offense that requires an increase in supervision. Chen's probabilistic model helps the State of Washington forecast the resources needed for supervising offenders.
  • Katherine Cai and Jeffrey Wilson at Arizona State University used SAS/IML to build a logistic model of longitudinal data with time-dependent covariates. The algorithm was applied to obesity data and to children’s health data in the Philippines.
  • Another paper by Wilson (co-authored with Kyle Irimata) used SAS/IML to implement exact logistic regression for nested binary data.

The SAS/IML language was used in many other e-posters and presentations that unfortunately I could not attend due to other commitments. All of the papers for the proceedings are available online, so you can search the proceedings for "IML" or any other topic interests you.

Experienced SAS programmers recognize that when an analysis is not available in any SAS procedure, the SAS/IML language is often the best choice for implementing the analysis. These papers and my discussions with SAS customers indicate that the SAS/IML language is actively used for advanced analytical models that require simulation, optimization, and matrix computations. Because SAS/IML is part of the free SAS University Edition, which has been downloaded 500,000 times, I expect the number of SAS/IML users to continue to grow.

Post a Comment

Visualize missing data in SAS

You can visualize missing data. It sounds like an oxymoron, but it is true.

How can you draw graphs of something that is missing? In a previous article, I showed how you can use PROC MI in SAS/STAT software to create a table that shows patterns of missing data in SAS. In addition to tables, you can use graphics to visualize patterns of missing data.

Counts of observations that contain missing values

As shown in the previous post, it is useful to be able to count missing values across rows as well as down columns. These row and column operations are easily accomplished in a matrix language such as SAS/IML. For example, the following SAS/IML statement read in data from the Sashelp.Heart data set. A single call to the COUNTMISS function counts the number of missing values in each row. The BAR subroutine creates a bar chart of the results:

proc iml;
varNames = {AgeAtStart Height Weight Diastolic 
            Systolic MRW Smoking Cholesterol};
use Sashelp.Heart;                         /* open data set */
read all var varNames into X;              /* create numeric data matrix, X */
close Sashelp.Heart;
 
title "Counts of Rows That Contain 0, 1, 2, or 3 Missing Values";
title2 "Sashelp.Heart Data";
Count = countmiss(X,"row");            /* count missing values for each obs */
call Bar(Count);                       /* create bar chart */
Visualize Missing Data in SAS: Count of observations that contain missing values

The bar chart clearly shows that most observations do not contain any missing values among the specified variables. A small percentage of observations contain one missing value. Even fewer contain two or three missing values.

Which observations contain missing values?

It can be useful to visualize the locations of observations that contain missing values. Are missing values spread uniformly at random throughout the data? Or do missing values appear in clumps, which might indicate a systematic issue with the data collection?

The previous section computed the Count variable, which is a vector that has the same number of elements as there are rows in the data. To visualize the locations (rows) of missing values, use the LOC function to find the rows that have at least one missing value and plot the row indices:

missRows = loc(Count > 0);                  /* which rows are missing? */
title "Location of Rows That Contain Missing Values";
call histogram( missRows ) scale="count"    /* plot distribution */
     rebin={125, 250}                       /* bin width = 250 */
     label="Row Number";                    /* label for X axis */
Visualize Missing Data in SAS: Location of rows that contain missing values

This histogram is very revealing. The bin width is 250, which means that each bar includes 250 observations. For the first 2000 observations, about 15 of every 250 observations contain a missing value. Then there is a series of about 500 observations that do not contain any missing observations. For the remaining observations, only about three of every 250 observations contain missing values. It appears that the prevalence of missing values changed after the first 2000 observations. Perhaps the patients in this data set were entered according to the date in which they entered the program. Perhaps after the first 2000 patients were recruited, there was a change in data collection that resulted in many fewer missing values.

A heat map of the missing values

The ultimate visualization of missing data is to use a heat map to plot the entire data matrix. You can use one color (such as white) to represent nonmissing elements and another color (such as black) to represent missing values.

For many data sets (such as Sashelp.Heart), missing observations represent a small percentage of all rows. For data like that, the heat map is primarily white. Therefore, to save memory and computing time, it makes sense to visualize only the rows that have missing values. In the Sashelp.Heart data (which has 5209 rows), only 170 rows have missing values. For each of the 170 rows, you can plot which variables are missing.

The following statements implement this visualization of missing data in SAS. The matrix Y contains only the rows of the data matrix for which there is at least one missing value. You can call the CMISS function in Base SAS to obtain a binary matrix with values 0/1. The HEATMAPDISC subroutine in SAS/IML enables you to create a heat map of a matrix that has discrete values.

ods graphics / width=400px height=600px;
Y = X[missRows,];              /* extract missing rows   */
call HeatmapDisc( cmiss(Y) )   /* CMISS returns 0/1 matrix */
     displayoutlines=0 
     colorramp={white black}
     xvalues=VarNames          /* variable names along bottom */
     yvalues=missRows          /* use nonmissing rows numbers as labels */
     showlegend=0 title="Missing Value Pattern";
Visualize Missing Data in SAS: Heat Map that Shows Missing Value Pattern

The black line segments represent missing values. This heat map summarizes almost all of the information about missing values in the data. You can see which variables have no missing values, which have a few, and which have many. You can see that the MRW variable is always missing when the Weight variable is missing. You can see that the first 2250 observations have relatively many missing values, that there is a set of observations that are complete, and that the missing values occur less frequently for the last 2000 rows.

If you do not have SAS/IML software, you can still create a discrete heat map by using the Graph Template Language (GTL).

Be aware that this heat map visualization is limited to data for which the number of rows that have missing values is somewhat small. (About 1000 rows or less.) In general, a heat map should contain at least one vertical pixel for each data row that you want to visualize. For the Sashelp.Heart data, there are 170 rows that have missing data, and the heat map in this example has 600 vertical pixels. If you have thousands of rows and try to construct a heat map that has only hundreds of pixels, then multiple data rows must be mapped onto a single row of pixels. The result is not defined, but certainly not optimal.


Do you ever visualize missing data in SAS? What techniques do you use? Has visualization revealed something about your data that you hadn't known before? Leave a comment.

Post a Comment

Examine patterns of missing data in SAS

Missing data can be informative. Sometimes missing values in one variable are related to missing values in another variable. Other times missing values in one variable are independent of missing values in other variables. As part of the exploratory phase of data analysis, you should investigate whether there are patterns in the missing data.

Counting is the simplest analysis of any data (missing or otherwise). For each variable, how many observations are missing? You can then proceed to more complex analyses. Do two variable share a similar missing value pattern? Can you predict that one variable will be missing if you know that another variable is missing?

This article shows a simple way to examine patterns of missing values in SAS. The example data is the Sashelp.Heart data set, which contains information about patients in the Framingham Heart Study. The meaning of most variables is evident from the variable's name. An exception is the MRW variable, which contains a patient's "Metropolitan Relative Weight." The MRW is a percentage of the patient's weight to an ideal weight. Thus an MRW score of 100 means "ideal weight" whereas a score of 150 means "50% heavier than ideal." The MRW is similar to the more familiar body-mass index (BMI).

The number of missing values for each variable

I have previously shown how to use PROC FREQ to count the number of missing values for numeric and character variables. The technique uses a custom format to classify each value as "Missing" or "Not Missing."

Alternatively, for numeric variables you can use PROC MEANS to count the number of missing values. PROC MEANS creates a compact easy-to-read table that summarizes the number of missing values for each numerical variable. The following statements use the N and NMISS options in the PROC MEANS statement to count the number of missing values in eight numerical variables in the Sashelp.Heart data set:

/* count missing values for numeric variables */
proc means data=Sashelp.Heart nolabel N NMISS;
var AgeAtStart Diastolic Systolic Height Weight MRW 
    Smoking Cholesterol;
run;
missingpatterns1

The NMISS column in the table shows the number of missing values for each variable. There are 5209 observations in the data set. Three variables have zero missing values, and another three have six missing values. The Smoking variable has 36 missing values whereas the Cholesterol variable has 152 missing values.

These univariate counts are helpful, but they do not tell you whether missing values for different variables are related. For example, there are six missing values for the Height, Weight, and MRW variables. How many patients contributed to those six missing value? Ten? Twelve? Perhaps there are only six patients, each with missing values for all three variables? If a patient has a missing Height, does that imply that Weight or MRW is missing also? To answer these questions you must look at the pattern of missing values.

Patterns of missing values

The MI procedure in SAS/STAT software is used for multiple imputation of missing values. PROC MI has an option to produce a table that summarizes the patterns of missing values among the observations. The following call to PROC MI uses the NIMPUTE=0 option to create the "Missing Data Patterns" table for the specified variables:

ods select MissPattern;
proc mi data=Sashelp.Heart nimpute=0;
var AgeAtStart Height Weight Diastolic 
    Systolic MRW Smoking Cholesterol;
run;
Missing Data Patterns table from PROC MI

The table reports the number of observations that have a common pattern of missing values. In addition to counts, the table reports mean values for each group. Because the table is very wide, I have truncated some of the mean values.

The first row counts the number of complete observation. There are 5039 observations that have no missing values.

Subsequent rows report the number of missing values for variables, pairs of variables, triplets of variables, and so forth. The variables are analyzed from right to left. Thus the second row shows that 124 observations have missing values for only the rightmost variable, which is Cholesterol. The third row shows that eight observations have missing values for only the Smoking variable. The fourth row shows that 28 observations have missing values for both Smoking and Cholesterol.

The table continues by analyzing the remaining variables that have missing values, which are Height, Weight, and MRW. You can see that MRW is never missing by itself, but that it is always missing when Weight is missing. Height is missing by itself in four cases, and is missing simultaneously with Weight in two cases.

Notice that each row of the table represents a disjoint set of observations. Consequently, we can easily answer the previous questions about how many patients contribute to the missing values in Height and Weight. There are 10 patients: four are missing only the Height measurement, four are missing only the Weight measurement (which forces MRW to be missing), and two are missing both Height and Weight.

This preliminary analysis has provided important information about the distribution of missing data in the Sashelp.Heart data set. Most patients (96.7%) have complete data. The most likely measurement to be missing is Cholesterol, followed by information about whether the patient smokes. You can see exactly how many patients are missing Height measurements, Weight measurements, or both. It is obvious that the MRW variable has a missing value if and only if the Weight variable is missing.

The "Missing Data Patterns" table from PROC MI provides a useful summary of missing values for each combination of variables. Examining patterns of missing values can lead to insight into the data collection process, and is also the first step prior to modeling missing data by using multiple imputation. In my next post, I will show how to use basic graphics to visualize patterns of missing data.

Post a Comment

Head-tail versus head-head: A counterintuitive property of coin tosses

I saw an interesting mathematical result in Wired magazine. The original article was about mathematical research into prime numbers, but the article included the following tantalizing fact:

If Alice tosses a [fair] coin until she sees a head followed by a tail, and Bob tosses a coin until he sees two heads in a row, then on average, Alice will require four tosses while Bob will require six tosses (try this at home!), even though head-tail and head-head have an equal chance of appearing after two coin tosses.

Yes, I would classify that result as counter-intuitive! When I read this paragraph, I knew I had to simulate these coin tosses in SAS.

Why does head-tail happen sooner (on average)?

Define the "game" for Alice as a sequence of coin tosses that terminates in a head followed by a tail. For Bob, the game is a sequence of coin tosses that terminates in two consecutive heads.

You can see evidence that the result is true by looking at the possible sequence of tosses that end the game for Alice versus Bob. First consider the sequences for Alice that terminate with head-tail (HT). The following six sequences terminate in four or fewer tosses:

  HT
 THT
 HHT
THHT
TTHT
HHHT

In contrast, among Bob's sequences that terminate with head-head (HH), only four require four or fewer tosses:

  HH
 THH
TTHH
HTHH

For Bob to "win" (that is, observe the sequence HH), he must first toss a head. On the next toss, if he obtains a head again, he wins. But if he observes a tail, he is must start over: He has to toss until a head appears, and the process continues.

In contrast, for Alice to "win" (that is, observe the sequence HT), she must first toss a head. If the next toss shows a tail, she wins. But if is a head, she doesn't start over. In fact, she is already halfway to winning on the subsequent toss.

This result is counterintuitive because a sequence is just as likely to terminate with HH as with HT. So Alice and Bob "win" equally often if they are observing flips of a single fair coin. However, the sequences for which Alice wins tend to be shorter than those for which Bob wins.

A simulation of coin tosses

The following SAS DATA step simulates Alice and Bob tossing coins until they have each won 100,000 games. For each player, the program counts how many tosses are needed before the "winning" sequence occurs. A random Bernoulli number (0 or 1 with probability 0.5) simulates the coin toss.

data CoinFlip(keep= player toss);
call streaminit(123456789);
label Player = "Stopping Condition";
do i = 1 to 100000;
   do player = "HT", "HH";               /* first Alice tosses; then Bob */
      first = rand("Bernoulli", 0.5);    /* 0 is Tail; 1 is Head */
      toss = 1;  done = 0;
      do until (done);
         second = rand("Bernoulli", 0.5);
         toss = toss + 1;                /* how many tosses until done? */
         if player = "HT" then           /* if Alice, ...        */
            done = (first=1 & second=0); /* stop when HT appears */
         else                            /* if Bob, ...          */
            done = (first=1 & second=1); /* stop when HH appears */
         first = second;                 /* remember for next toss */
      end;
      output;
   end;
end;
run;
 
proc means data=CoinFlip MAXDEC=1 mean median Q3 P90 max;
   class player;
   var toss;
run;
Sample statistics for number of tosses required to end a simulated coin game

The MEANS procedure displays statistics about the length of each game. The mean (expected) length of the game for Bob (HH) is 6 tosses. 75% of games were resolved in eight tosses or less and 90% required 12 or less. In the 100,000 simulated games, the longest game that Bob played required 61 tosses before it terminated. (The maximum game length varies with the random number seed.)

In contrast, Alice resolved her 100,000 games by using about 50% fewer tosses. The mean (expected) length of the game for Alice (HT) is 4 tosses. 75% of the games required five or fewer tosses and 90% were completed in seven or fewer tosses. In the 100,000 simulated games, the longest game that Alice played required 22 tosses.

Distribution of tosses for two games with different winning criteria

You can create a comparative histogram to compare the distribution of the game lengths. The distribution of the length of Bob's games is shown in the upper histogram. It has a long tail.

In contrast, the distribution of Alice's game lengths is shown in the bottom of the panel. The distribution drops off much more quickly than for Bob's games.

It is also enlightening to see the empirical cumulative distribution functions (ECDFs) for the number of tosses for each player. You can download a SAS program that creates the comparative histogram and the ECDFs. For skeptics, the program also contains a simulation that demonstrates that a sequence terminates with HH as often as it terminates with HT.

The result is not deep, but it reminds us that the human intuition gets confused by conditional probability. Like the classic the Monty Hall problem, simulation can convince us that a result is true, even when our intuition refuses to believe it.

Post a Comment