The use of SAS/IML software in published research

SAS/IML software is used by many SAS programmers, primarily for creating custom algorithms and macros that implement statistical analyses that are not built into any SAS procedure. I know that PROC IML is used regularly by pharmaceutical companies, by the financial and insurance industries, and by researchers in medical colleges and business schools, among others.

However, when I was recently asked how many research papers are published each year that use SAS/IML, I had no idea. I conjectured "a few dozen," figuring that the most SAS/IML programmers work in a corporate setting or in government, and that these researchers are less likely than academics to publish their results in journals.

Over the weekend I used Google Scholar to try to get a better estimate. I constructed the following query for Google Scholar:

   SAS +IML OR "proc iml" -author:wicklin

The query omits any papers written by me or that appear on the domain. The idea was to exclude any white papers, conference proceedings, or marketing material that is created or hosted at SAS. The results also exclude articles posted to the SAS/IML File Exchange.

The results surprised me: I was wrong by an order of magnitude! I clicked on the links for a few dozen publications to ascertain how many of the hits were false positives. For example, an article that says "we decided not to use SAS/IML in this study," would appear on the list, even though the authors did not actually use the programming language. There were a few false positives and there were a few SAS manuals in the list. However, the vast majority of the Google list consisted of scholarly journal articles, books, or conference proceedings that used SAS/IML in a nontrivial way.

I encourage you to submit the query yourself and to look at the variety of applications and the wide range of journals. More than 2,500 entries were published prior to 1995. In addition to those papers, the following SAS DATA step gives the number of Google Scholar entries for the SAS/IML query for the past 20 years:

/* Results from Google Scholar. Downloaded 6/28/2015 
   "sas" +IML OR "proc iml" -author:wicklin
   There were 2510 results when year <= 1994 
data IMLPub;
input Year Publications;
1995 231  1996 255  1997 311  1998 301
1999 299  2000 355  2001 361  2002 424
2003 465  2004 493  2005 543  2006 568
2007 526  2008 611  2009 633  2010 580
2011 589  2012 512  2013 603  2014 552
title "Number of Publications that Mention SAS/IML Software";
title2 "Data from Google Scholar";
proc sgplot data=IMLPub;
   series x=Year y=Publications / markers;
   yaxis min=0 grid values=(0 to 600 by 100) valueshint;
   xaxis grid;

The graph appears to increase until about 2005, and has been approximately constant since then. In the past 10 years, Google Scholar reports about 550 publications per year. I had no idea that the number was that high.

I do not advocate using internet search engines to rank software based on the number of web sites that mention the software. I have argued that using the number of search results as a proxy for popularity is of dubious value and is fraught with statistical perils.

However, I found it intellectually interesting to read the titles and excerpts of the scholarly publications that mention SAS/IML software. Browse the list yourself, or see the list that includes papers on the domain for a more complete perspective.

If you are thinking about using SAS/IML software in your next research project, you might want to search Google Scholar first. Someone else might have already written a scholarly paper that solves your problem!

Post a Comment

Merge observed outcomes into a list of all outcomes

When you count the outcomes of an experiment, you do not always observe all of the possible outcomes. For example, if you roll a six-sided die 10 times, it might be that the "1" face does not appear in those 10 rolls. Obviously, this situation occurs more frequently with small data than with large, although even for large data set, you might not observe rare events.

When conducting a statistical analysis, it is sometimes necessary to specify the unobserved categories. I have previously shown how to use the DATA step to add outcomes with zero counts and run PROC FREQ on the resulting data. I have also shown how to use the DATA step to merge observed data with expected values so that you can overlay a parametric density on a histogram.

In the SAS/IML language, you can use the LOC-ELEMENT technique to merge counts for observed outcomes into a vector of zeros. The result is the counts for all outcomes, where 0 indicates that an outcome is unobserved.

The following SAS/IML statements analyze 10 rolls of a six-sided die. The TABULATE subroutine counts the number of times each face appeared. The MergeCounts module inserts the observed counts into a vector.

proc iml;
/* Use LOC-ELEMENT technique to merge observed outcomes into all outcomes */
start MergeCounts(allVals, obsVals, counts);
   allCounts =  j(nrow(allVals),ncol(allVals),0);  /* array of 0s   */
   idx = loc(element(allVals, obsVals)); /* LOC-ELEMENT technique   */
   allCounts[idx] = counts;                        /* insert counts */
   return( allCounts );   
rolls = {4 6 4 6 5 2 4 3 3 6};           /* only 5 faces observed */
call tabulate(Vals, Counts, rolls);
allVals = 1:6;                           /* all outcomes */
allCounts = MergeCounts(allVals, Vals, Counts);
print Counts[c=(char(Vals))] allCounts[c=(char(allVals))];

Whereas the first table contains only the five observed faces, the second table lists all six faces of the die, with 0 indicating the "1" face was not observed.

In addition to the applications mentioned earlier, this technique is useful when you are computing goodness-of-fit statistics. For example, to compute the chi-square statistic for the null hypothesis that the die is fair (each face appears equally often), you need to form the difference between the observed and expected values for each face. The following statements implement a chi-square goodness-of-fit test:

Expected = sum(allCounts) / 6;
ChiSq = sum((allCounts-Expected)##2 / Expected); /* sum (O-E)^2/E over categories */
pval = 1 - cdf("chisquare", ChiSq, ncol(allVals)-1);  /* p-value for test */
print ChiSq pval;

Most statistical textbooks recommend that you not use the chi-square test if the expected counts are less than 5, as they are for this tiny data set. Nevertheless, this example illustrates the usefulness of adding unobserved outcome categories to data that contain the observed categories.

Post a Comment

An easy way to use numbers for column headers

When I am computing with SAS/IML matrices and vectors, I often want to label the columns or rows so that I can better understand the data. The labels are called headers, and the COLNAME= and ROWNAME= options in the SAS/IML PRINT statement enable you to add headers for columns and rows, respectively. However, sometimes I want to use numerical values as headers, which is a problem because the COLNAME= and ROWNAME= options expect character strings.

The solution, of course, is to use a SAS format to convert numbers to strings. The SAS/IML language supports the CHAR function to convert a number into a string. By default, the CHAR function applies the BEST. format with a default field width (usually 12). You can specify a field width by using an optional argument. You can also specify the number of decimal places in the number, which is equivalent to specifying a w.d format.

For example, suppose you have data that indicates the results of 10 rolls of a single six-sided die. You want to print the values of the faces along with a count of the number of times that each face appeared. You can use the TABULATE function to count the frequency of each face, as follows:

proc iml;
rolls = {4 6 4 6 5 2 4 3 3 6};
call tabulate(Vals, Counts, rolls);
print Vals, Counts;

The problem with this output is that the labels (the faces) are not attached to the counts. You can use the COLNAME= option on the print statement to label columns. The following statement uses the CHAR function to convert the numerical faces to strings:

print Counts[colname=(char(Vals))];  /* use face value to label counts */

Although the CHAR function is sufficient for most data, you might sometimes need to apply a special SAS format. For example, your labels might be dates or times. You can the PUTN function to apply an arbitrary format to a number. For example, the following data specifies US highway fatalities during the 2014 Independence Day weekend and uses the DATE7. format to convert dates (which are stored as numbers) into strings:

days = "03JUL2014"d:"06JUL2014"d;  /* stored as 19907:19910 */
fatalities = {118 212 87 123};
print fatalities[colname=(putn(days,"Date7."))];

You can use any string for column headers, including strings that include spaces.

You can use the same technique for adding row headers. For example, if you wanted to enumerate the rows of a matrix that has N rows, you can specify colname=(char(1:N)) as an option to the PRINT statement.

Post a Comment

The sensitivity of Newton's method to an initial guess

Image from Wikipedia:

In my article about finding an initial guess for root-finding algorithms, I stated that Newton's root-finding method "might not converge or might converge to a root that is far away from the root that you wanted to find." A reader wanted more information about that statement.

I have previously shown how to implement Newton's method in SAS. The behavior of Newton's method depends on the initial guess. If you provide a guess that is sufficiently close to a simple root, Newton's method will converge quadratically to the nearby root. However, if your guess is near a critical point of the function, Newton's method will produce a "next guess" that is far away from the initial guess. Further iterations might converge to an arbitrary root, might endlessly cycle in a periodic or aperiodic manner, or might diverge to infinity.

The dynamics of Newton iteration can be quite complex. If you owned a PC in the 80's and early 90's, you might have spent countless hours computing Mandelbrot sets and Julia sets. You might have seen pictures like the one at the beginning of this article, which show the domains of attraction for Newton's iteration for a cubic polynomial. In the picture, each point in the complex plane is colored according to which root Newton's method converges to when it begins at that point. The points that eventually converge to a root are the Fatou set, whereas the points that do not converge form the Julia set.

The sensitivity of Newton's method

You can perform the same kind of computer experiments for Newton's method applied to a real function. (You can download the SAS/IML programs that I used to create the graphs in this article.) Consider the polynomial f(x) = x5 – 2 x4 – 10 x3 + 20 x2 + 9 x – 18. The roots of the polynomial are {-3, -1, 1, 2, 3}. The polynomial has critical points (where the derivative vanishes) near -2.3, -0.2, 1.5, and 2.6. Recall that Newton's method involves iteration of the rational function N(x) = xf(x)/f'(x), which has singularities at the critical points of f.

You can ask the following question: For each point, x, to which root does Newton's method converge when x is the initial guess? You can also keep track of how many iterations it takes for Newton's method to converge to a root.


If you apply Newton's method to 250 initial conditions on the interval [-4, 4], you get the results that are summarized in the needle plot to the left. (Click to enlarge.) The color of the needle at x indicates the root to which x converges under Newton's method. The height of the needle indicates the number of iterations required to converge.

You can see that initial guesses that are close to a root converge to the nearby root in five or fewer iterations. Near the critical points, Newton's method requires more iterations to converge, often more than 10 and sometimes more than 20 iterations. The regions near the critical points are interlaced bands of color, which indicates that the dynamics of Newton's method is complicated in those regions. A small change in the initial guess can result in a big difference in the Newton iterations.


The dynamics near 0 seem interesting, so let's zoom into that region. In particular, repeat the previous computation for 250 initial conditions on the interval [-0.5, 0.5]. The needle plot to the left reveals additional details. All roots can be reached from initial guesses in this region, even though the nearest roots are at x = -1 and x = 1 (roots #2 and #3). Again, there are regions for which many iterations are required and there is an interlacing of colors, which indicates that newton's method is sensitive to the initial guess.

You can zoom again into a multicolored region and repeat the computations on a new interval. The behavior continues ad infinitum. You can find two initial guesses that differ by an arbitrarily small amount, yet their iteration under Newton's method eventually diverge and result in different roots. This is known as sensitive dependence on initial conditions, or, more poetically, as the butterfly effect.


Newton's method is one of my favorite root-finding techniques. However, it is important to understand that the famous quadratic convergence of Newton's method applies to initial guesses that are close to a root. For an arbitrary initial guess, Newton's method can be result in divergence, periodic orbits, or convergence to a far-away root. Fractals and chaos are fun topics to explore, but not when you need to find a root as part of your work. Therefore I recommend using a pre-search method, as described in my previous article, to obtain a good initial guess.

Further details

Post a Comment

Finding roots: Automating the search for an initial guess

A SAS programmer asked an interesting question on a SAS Support Community. The programmer had a nonlinear function with 12 parameters. He also had file that contained 4,000 lines, where each line contained values for the 12 parameters. In other words, the file specified 4,000 different functions. The programmer wanted to know how to use SAS to find at least one root for each of the 4,000 functions.

Different kinds of root-finding algorithms

Root-finding algorithms fall into two general classes: "shooting methods" and "bounding methods." Shooting methods include the secant algorithm and Newton's method. These iterative methods use derivative information to try to predict the location of a root from a guess. These methods are indispensable for multivariate functions, and you can read about how to implement Newton's method in SAS. Unfortunately, shooting methods might not converge or might converge to a root that is far away from the root that you wanted to find.

For univariate functions, bounding methods are simpler. Bounding methods include the bisection method and Brent's method. The advantage of bounding methods is that they are guaranteed to find a root inside a bounding interval. A bounding interval is an interval [a,b] for which f(a) and f(b) have different signs. If f is a continuous function, there must be a root in the interval.

How to approximately find a root?

For both classes of methods, it is important to determine an approximate location for the root. If you have a function of one variable, you can graph the function on a wide domain such as [-20,20] or [-100, 100] to approximate the location of roots. If necessary, redraw the graph on a smaller domain to "zoom in" on the details.

For multivariate functions, use the tips in the article "How to find an initial guess for an optimization," and apply them to the norm of the function ||f||2, which is the sum of squares of the components of the function. When ||f||2 is small, there might be a root nearby.

How to automate the search for a root?

Graphing a univariate function usually means evaluating the function at many points in the domain and plotting the ordered pairs (x, f(x)). Graphing enables you to see the approximate x values for which f changes signs.

However, you don't actually need to create the graph to use this technique. You could evaluate the function on a uniform subdivision in the domain and then use a program to detect intervals where the function changes signs. I call this a "pre-search" technique, because it isn't actually solving for roots, it is merely finding regions that contain roots.

To find regions that contain roots, choose a large interval [a,b] and subdivide that interval into n smaller subintervals, each of size (b-a)/n. Evaluate the function at each point of the subdivision and locate the intervals for which the function changes sign. You can use the DIF function and the SIGN function to locate sign changes.


Let's see how this technique works in practice. The following SAS/IML function evaluates a fifth-degree polynomial by using Horner's method. The graph of the polynomial (shown to the left) reveals that it has roots near the values {-3, -1, 1, 2.5, 3}. The MeshEval function evaluates the polynomial at evenly spaced points and returns the subintervals on which the polynomial changes signs:

proc iml;
/* Evaluate fifth-degree polynomial function. Make sure this function can 
   evaluate a vector of x values. 
   To generalize, write:  start Func(x) global(p);   */
start Func(x);
   /* c5*x##5 + c4*x##4 + ... + c0 */
   p = {1 -2 -10 20 9 -14};       /* Coefficients: p[1]=c5, p[2]=c4, etc */
   y = j(nrow(x), ncol(x), p[1]); /* initialize to coef of largest deg */
   do j = 2 to ncol(p);           /* Horner's method of evaluation */
      y = y # x + p[j];
/* Evaluate function 'Func' and return intervals where function
   changes sign. Specify a large interval [a,b] and the number, n, 
   of equally spaced subdivisions of [a,b].    */
start MeshEval(ab, n);
   x = ab[1] + range(ab)/n * (1:n);         /* n values of x */
   y = Func(x);                             
   difSgn = dif(sign(y));                   
   idx = loc(difSgn=2 | difSgn=-2);
   if IsEmpty(idx) then return ( {. .} );
   return( x[idx-1] || x[idx] );
ab = {-10 10};                /* search for roots on this big domain */
n = 100;                      /* break up into n smaller intervals */
intervals = MeshEval(ab,n);   /* intervals on which the function changes sign */
F = Func(intervals);
print intervals[c={"a" "b"} L=""]   F[c={"F(a)" "F(b)"} L=""];

The table shows five bounding intervals on which the function changes signs. Because the function is continuous, there are roots in the intervals [-3.2, -3], [-1, -0.8], [0.6, 0.8], [2.2, 2.4], and [2.8, 3]. You can use the FROOT function to find the roots in each bounding interval:

roots = froot("Func", intervals);
print roots;

The FROOT function found all five roots, one root per bounding interval.

How to find the roots of 4,000 functions?

Returning now to the problem asked in the SAS Support Community, the infrastructure is in place to solve for the roots of an arbitrary number of polynomials. All you need to do is add a GLOBAL statement to the START statement and pass in p as a global variables that contains the coefficients. For each set of coefficients in the file, assign p and call the MeshEval and FROOT functions.

In fact, the Func function is written so that it can evaluate a polynomial of any degree, so the same function will work regardless of the number of elements in the p vector.

This approach will always succeed for odd polynomials because you can always find a bounding interval. However, it might not find all roots, and this approach can fail for an arbitrary function. For example, the quadratic polynomial fx) = x22 has roots at ±ε. When ε is small, the EvalMesh function might not find a bounding interval unless n is very large. If you are sure that the function has a real root, but EvalMesh does not find it, you can iteratively increase the value of n and call EvalMesh until the root is found.

Post a Comment

Everything you wanted to know about writing SAS/IML modules

One of the fundamental principles of computer programming is to break a task into smaller subtasks and to modularize the program by encapsulating each subtask into its own function. I have written many blog posts over the years about how to define and use functions in the SAS/IML language. I thought it would be useful to compile that information in a single place so that SAS/IML programmers can find it easily.

I will periodically add material to this post, so that it remains current.

A simple example

The purpose of SAS/IML software is to enable you to extend the capabilities of SAS software. Creating user-defined modules is an important part of that purpose. Suppose that you want to define a new function that squares each element of a matrix. The following statements define the function Sqr and call the function with a vector parameter:

proc iml;
/* square each element of a matrix */
start Sqr(x);
   return( x#x );  /* elementwis multiplication */
x = 1:5;           /* x = {1 2 3 4 5} */
y = Sqr(x);        /* square each element: y = {1 4 9 16 25} */

How to define and use SAS/IML function modules

The following list contains fundamental information about creating SAS/IML functions.

  • It's easy to define a new function in SAS/IML software. Use the START statement to specify the function name and arguments, write the body of the function, and use the FINISH statement to signal that the definition is complete.
  • The parameters to a SAS/IML function are passed by reference, which means that their values can be modified from within the body of the function.
  • There are two kinds of modules in the SAS/IML language: functions and subroutines. Functions use the RETURN statement to return a value. Subroutines do not return a value. They typically make modifications to the arguments (for example, sorting a vector), although you can also use subroutines to return multiple arguments from a module.

How to save, reuse, and share SAS/IML functions

An advantage of writing a module is that you can call the same function from several programs.

Optional parameters and default values

SAS/IML 12.1 and 12.3 introduced optional parameters and default values for parameters.

Scope of functions and variables

The following list provides technical details about how modules work.

Debugging tips

Post a Comment

Execute SAS/IML statements that are in a file at run time

A feature of SAS/IML 13.2 (shipped with SAS 9.4m2, Aug 2014) is the ability to execute SAS/IML statements that are in a file. The feature is implemented by the new EXECUTEFILE subroutine.

This feature is similar to the CALL EXECUTE statement. The difference is that the EXECUTEFILE subroutine reads, parses, and executes statements from a file, rather than from a character string.

The EXECUTEFILE subroutine is also similar to the %INCLUDE statement, which is used by many SAS programmers to insert the contents of a file into a program. You can think of the EXECUTEFILE subroutine as a run-time equivalent of the %INCLUDE statement, which runs at parse time. For an explanation of parse time and run time, see "How to find and fix programming errors."

How does EXECUTEFILE differ from %INCLUDE?

The EXECUTEFILE subroutine differs from the %INCLUDE statement in two ways. First, the %INCLUDE statement unconditionally tries to read the file. Compare the results of the following IF-THEN statements:

proc iml;
if x>0 then
   %include "";         /* ERROR if file does not exist */
if x>0 then 
   call executefile(""); /* 13.2: file conditionally executed */

The first IF-THEN statement returns an error if the file does not exist: ERROR: Cannot open %INCLUDE file That is because the SAS pre-processor tries to insert the contents of the file when the IF-THEN statement is parsed. The pre-processor does not know about the value of x. In contrast, the second IF-THEN statement does not return an error. The IF condition is not true, so the CALL EXECUTEFILE statement is never run. Therefore it does not matter whether the file exists.

The second way that the %INCLUDE and CALL EXECUTEFILE statements differ is when called inside a loop. If you use a %INCLUDE statement inside a loop, the code is inserted and parsed once. All macro values are resolved at parse time. In contrast, the EXECUTEFILE subroutine will load, parse, and execute the file during each iteration of the loop. This means that the contents of the file can change between iterations and that macro variables are resolved for each iteration.

How to use the EXECUTEFILE subroutine

To use the EXECUTEFILE subroutine, you can use your favorite text editor to create a plain text file in some location that SAS can read. The text file should contain valid SAS/IML statements, but should not include the PROC IML statement. Since SAS can write text files, this example uses the DATA step to create a text file that has two SAS/IML statements:

filename ExeFile "C:/temp/";
data _null_;
   file ExeFile;
   time = putn(datetime(), "DATETIME17.");
   cmd = catt('print "The file was created on ', time, '";');
   put cmd;
   cmd = 'print "The current time is " (datetime())[F=DATETIME17.];';
   put cmd;

The DATA step creates a text file named The actual text depends on when you run the DATA step, but it will look something like this:

print "The file was created on 11JUN15:06:11:33";
print "The current time is " (datetime())[F=DATETIME17.];

The first statement is PRINT statement that displays a literal string. The string was created by the DATA step to include the date and time at which the DATA step ran. The second statement is a PRINT statement that calls the DATETIME function and displays the time at which the statement is run.

The following SAS/IML program uses the EXECUTEFILE subroutine to run the two print statements in the file:

/* Execute statements in file in SAS/IML */
proc iml;
   run executefile("C:/temp/"); 

When you run the PROC IML program, it executes the statements in the text file. Notice for my example that the program was run about 30 minutes after the file was created.

When to use the EXECUTEFILE subroutine

In 99% of the use cases, the classic %INCLUDE statement is sufficient to include statements into a SAS/IML program. The EXECUTEFILE subroutine becomes necessary when the statements refer to global statements or macro variables that you do not want to evaluate until run time.

Using macro variables in SAS/IML loops can be challenging. In a previous article, I wrote about calling global SAS statements within a SAS/IML loop. The example in that article uses the CALL EXECUTE statement to set the TITLE statement with the current value of the counter for an iterative loop. You can use the EXECUTEFILE subroutine to perform the same computation:

filename ExeFile "C:/temp/";
data _null_;
   file ExeFile;
   put 'TITLE "mu = &mu";';  /* TITLE includes current value of macro variable */
proc iml;
x = j(50,1);
do mu = 0 to 1;
   call randgen(x, "Normal", mu);   /* x ~ N(mu, 1) */
   call symputx("mu", mu);          /* store counter in macro variable */
   run executefile(ExeFile);        /* can use fileref in PROC IML */
   call histogram(x);

The program displays two histograms. The first has the title "mu = 1", and the second has the title "mu = 2". The histograms are shown in my previous article. Notice that in PROC IML you can use a SAS fileref as the argument to the EXECUTEFILE subroutine. (That option is not support in SAS/IML Studio.)

For most situations, you can continue to use the %INCLUDE statement to include statements into a SAS/IML program. But programmers who need to conditionally read and execute statements can use the EXECUTEFILE subroutine. By using the EXECUTEFILE subroutine, SAS does not see the code until run time.

Post a Comment

The spiral of splatter

"Daddy, help! Help me! Come quick!"

I heard my daughter's screams from the upstairs bathroom and bounded up the stairs two at a time. Was she hurt? Bleeding? Was the toilet overflowing?

When I arrived in the doorway, she pointed at the wall and at the floor. The wall was splattered with black nail polish. On the floor laid a broken bottle in an expanding pool of black ooze.

"It slipped," she sobbed.

As a parent, I know that there are times when I should not raise my voice. I knew intellectually that this was one of those times. But staring at that wall, seeing what I was seeing, I could not prevent myself from yelling.

"Oh my goodness!" I exclaimed. "Is that a logarithmic spiral?"


There on the wall was a spiral formed from the drops of black nail polish as the bottle had spiraled out of my daughter's hands and ejected its contents. Logarithmic spirals appear in many natural phenomena, such as the spiral arms of galaxies, the bands of hurricanes, and the curve of a mollusk's shell. However, I had not expected to see such a beautiful spiral of splattered nail polish on my wall.

I took a picture of the spiral, and we began a futile attempt to clean the wall. The nail polish proved impossible to wipe from the wall, although we were able to clean the tile floor.

How do you fit a logarithmic spiral?

As I was cleaning up the mess, my mind was racing. Is this really a logarithmic spiral? Could I construct a "spiral of best fit" that passes through the splattered drops?

I remembered that I once fit a circle to data by minimizing the radial distance between the data points and a circle with center (x0, y0) and radius R. This problem seems similar, but there are two differences. In the circle problem, I used the implicit equation of a circle (x-x0)2 + (y-y0)2 = R2. The logarithmic spiral is usually given explicitly as a polar (parametric) curve with the standard form R(θ) = a*exp(bθ), where a and b are unknown parameters. The second problem is that the spiral has multiple branches. For example, in the vertical direction there are several polar angles (π/2 and 5π/2), each with a different radius.

To construct a "spiral of best fit," you can do the following:

  1. Choose a coordinate system and use the photo to find the (x,y) coordinates of points along the spiral curve.
  2. Create an objective function whose minimum will produce parameters for the spiral that best fits the data. This problem has four parameters: The center of the spiral (x0, y0) and the two parameters a and b the amplitude and decay rate for the logarithmic spiral.
  3. Fit the curve to the data and see if the splatter marks on the wall appear to be modeled by the curve.

Finding points along the curve


As punishment for my daughter's carelessness, I told her that she had to help me input data from the photograph. She claimed that this punishment was "cruel and unusual," but she did it anyway.

I printed out the photo and used a ruler to draw a grid of lines 1 centimeter apart across the page. I chose the origin to be a big blob of nail polish at the end of spiral, about 1/3 up from the bottom of the image. I then marked points at 5 mm intervals along the curve and read out the coordinates (to the nearest 1 mm) as my daughter typed them into SAS. The resulting coordinates are shown in the scatter plot. A large portion of the nail polish landed on the door-frame molding, which is raised 1.5 cm above the wall. That explains the apparent discontinuity in the upper left portion of the graph.

Create an objective function

To find the logarithmic spiral that best fit the data, you can minimize a function that computes the sum of the squared distances (in the radial direction) between the data and the predicted points along the spiral. Because the center of the spiral is unknown, you have fit four parameters: the center of the spiral (x0, y0) and the parameters a and b.

For given values (x0, y0), you can center the (x, y) coordinates by subtracting the coordinates of the center point. In a previous blog post, I showed how to obtain an increasing sequence of polar angles from the coordinates of (centered) data that are ordered along a parametric curve. Let θi be the polar angle that corresponds to the point (xi, yi). By using the fact that Euclidean and polar coordinates are related by x = R cos(θ) and y = R sin(θ), you can write down the predicted Euclidean coordinates of the logarithmic spiral:
ui = a cos(thetai) exp(b*thetai)
vi = a sin(thetai) exp(b*thetai)

The objective is therefore to minimize the radial distance between the predicted and observed values:
F(x0, y0, a, b) = Σi   (xi-ui)2 + (yi-vi)2

Fit a logarithmic spiral to the data

To optimize the function I decided to use the trust region method (the NLPTR call) in SAS/IML software, just as I did when fitting a circle to data. Based on the scatter plot, I guessed that the center was approximately (x0, y0) = (1, -1), and that a ≈ 26 and b ≈ -0.5.


You can download the complete SAS program, which uses SAS/IML to optimize the parameters. The logarithmic spiral that best fits the data is shown overlaid on the scatter plot of the data. The curve fits remarkably well, with the obvious exception of the nail polish that splattered on the door frame (trim) instead of on the wall. If that section of data were moved upwards by 1–1.5 cm, then the fit would be even better. For the record, the parameter estimates are (x0, y0) = (0.69, -1.06), a = 22.34, and b = -0.35.

All models are wrong...including this one

As George Box famously said, "All models are wrong, but some are useful." However, this spiral model might be both wrong and useless.

I was so stunned by the appearance of this spiral on my wall that I didn't stop to think about whether a logarithmic spiral is a curve that can appear when a rotating bottle falls under the effects of gravity. The physics of falling objects is well understood.

Of course, I have no idea how the bottle was tumbling when it left my daughter's hands, nor do I know what angle the bottle was moving relative to the wall. Nevertheless, you can make some simplifying assumptions (bottle falling straight down, rotating in the horizontal plane as it falls,...) and compute the path that the nail polish would trace out in that simple situation. The physics-based model (not shown) also looks similar to the pattern on my wall. It's been a long time since freshman physics, and this is not a physics blog, so I invite the interested reader to download the SAS program and read my attempt to derive a physics-based model.

The main difference between the physics-based path and the logarithmic spiral is the function that is responsible for the radial decay of the spiral. In the logarithmic spiral, the decay function is the exponential function, which is a decreasing function that never reaches zero. In the physics-based model, the decay function is related to the distance between the bottle and the wall, which is a decreasing function that reaches zero in finite time. Any decreasing function will lead to a spiral, some of which have names. For example, a linear decreasing function is an Archimedean spiral.

I think that these data can be fit by many families of spirals because the bottle rotated only about 1.25 times before it hit the wall. To distinguish between the spirals, the data should show at least two full rotations so that the decay function can be more accurately modeled. However, no matter what model you choose, you can use the same optimization ideas to fit the parameters in the model.

To me it doesn't matter whether the spiral on my wall was logarithmic, Archimedean, or something else. It was cool to see, and I got to spend a little time talking about math with my daughter, which is a special treat. And, of course, I enjoyed the statistical modeling and SAS programming that I used to analyze these data.

Now I must stop writing. I have to paint that wall!

Post a Comment

Computing polar angles from coordinate data

Equations that involve trigonometric functions can have infinitely many solutions. For example, the solution to the equation tan(θ)=1 is θ = π/4 + kπ, where k is any integer. In order to obtain a unique solution to the equation, we define the "arc" functions: inverse trigonometric functions that return a principal value. For example, the Arctan function returns a principal value in the range (-π/2, π/2), such as Arctan(1) = π/4. In SAS software, the ATAN function computes the Arctan function.

You can extend the Arctan function in a natural way: given the coordinates (x, y) of a point in the plane, you can compute the angle formed between the x axis and the vector from the origin to (x, y). SAS provides the ATAN2 function for this computation. The ATAN2 function takes two arguments (the coordinates of an (x, y) pair) and returns the angle in the range (-π, π] that corresponds to the vector angle.

The ATAN and ATAN2 functions are supported in Base SAS, so you can call them from the DATA step or from the SAS/IML language. The following statements compute the points on the unit circle for several polar angles. The ATAN function evaluated at y/x returns the principal arctangent function. The ATAN2 function evaluated at (y, x) returns the polar angle in (-π, π]. Notice that order of the arguments for the ATAN2 function is the reverse of what you might expect!

data Arctan;
drop pi i theta;
array a[8] $6 _temporary_ 
           ("0" "pi/4" "pi/2" "3*pi/4" "pi" "5*pi/4" "3*pi/2" "7*pi/4") ;
pi = constant("pi");
do i = 0 to 7;
   Angle = a[i+1];
   theta = i*pi/4;        /* angle in [0, 2*pi) */
   x = cos(theta);        /* (x,y) on unit circle */
   y = sin(theta);
   atan = atan( y/x );    /* ATAN takes one argument */
   atan2 = atan2(y, x);   /* ATAN2 takes two arguments: Notice order! */
proc print; run;

Notice that the ATAN and ATAN2 functions agree for points that are in the first and fourth quadrants. Notice also that both functions return their values in radians. You can convert radians to degrees by multiplying the radian measure by 180/π.

Compute angles along polar curves


Suppose that you have data points such as are shown in the scatter plot at the left. The ATAN2 function enables you to compute an angle in (-π, π] for a given data point. But to reconstruct the curve, you need to compute angles outside that interval. The scatter plot shows points that lie along a logarithmic spiral as computed by the following program.

proc iml;
pi = constant("pi");
/* create (x,y) coordinates of polar curve for (golden) logarithmic spiral 
   r(theta) = a*exp(b*theta)  */   
a = 1;   b = -0.3063;
theta = do(0, 6*pi, pi/24)`;
x = a*cos(theta)#exp(b*theta);
y = a*sin(theta)#exp(b*theta);
title "Logarithmic Spiral";
call scatter(x,y) other="refline 0 / axis=y; refline 0 / axis=x;";

Recently I had data that looked like the scatter plot, and I needed to recover the angles that correspond to the points along the curve. If you naively use the ATAN2 function, all angles are translated into the interval (-π, π]. However, if the data are ordered according to the polar angle, as in these data, you can monitor the output of the ATAN2 function as you sequentially traverse the points along the curve. When one point is in the second quadrant but the next point is in the third quadrant, the ATAN2 function "jumps" by -2π across those two points. You can detect the discontinuity and add 2π to the value of the ATAN2 function evaluated all subsequent points along the curve. If you do this every time that the curve crosses the negative x axis, the resulting set of angles will correctly represent the polar angle along the curve.

This algorithm is implemented in the following SAS/IML module, which assumes that the polar angle increases along the curve. With slightly more complicated logic, you can derive the polar angle for points along any parametric curve in the plane.

/* assume the polar angle is an increasing function along the curve */
start IncrAngle(x, y);
   theta = atan2(y,x);            /* angle from center to point */
   dif = dif(theta);              /* difference of angles between adjacent pts */
   idx = loc(dif^=. & dif<0);     /* for which pts does angle decrease? */
   n = nrow(dif);
   pi = constant("pi");
   do i = 1 to ncol(idx);
      theta[idx[i]:n] = 2*pi + theta[idx[i]:n]; /* add 2*pi at discontinuities */
phi = IncrAngle(x,y);         /* test function for pts along logarithmic curve */

The variable phi is a vector of increasing values. It is equal to the theta vector, which contained the original sequence of angles in [0, 6π].


For any value, the ATAN function computes the principal arctangent of that value. For a point (x, y) in the plane, the ATAN2 function enables you to compute the angle formed between the x axis and the vector from the origin to (x, y). The ATAN2 function returns an angle in the range (-π, π], and the ATAN2 function has a jump discontinuity of 2π across the negative x axis. Using these facts, you can construct a continuous function that returns the polar angle for points along a parametric curve in the plane.

In my next blog post I will use this function to analyze data that seems to lie along a parametric curve in the plane.

Post a Comment

Fit a circle to data


I still remember the first time I was asked to "consult" on a statistical problem. A former physics professor had some students who had gathered data that should lie along an arc of a theoretical circle. The professor asked if there was a regression technique that could find the center and radius of the circle that best fits the data.

After thinking about the problem, I sent him a method that the students could use to fit a circle to their data. The method differs from an ordinary regression in one important way. In the familiar least squares regression, you minimize the sum of the squares of the vertical distance between the observed Y values and the predicted Y values, where the prediction is a function of observed X values. For fitting a circle to data, you want to minimize the sum of the squares of the radial distances.

I have long since lost the sheet of paper that contained the real data, but the following made-up data is similar to what I remember. The scatter plot at the top of this article shows that the points seem to lie along the arc of a circle.

data Circle;
input x y  @@;
2.1   0     1.3  0.1    0.95 0.2    1.7    0      0.1  0.7   0.4  0.45
1.3   0.1   0.6  0.35   1.95 0      2.65   0.05   2.7  0.1   2.95 0.15
1.2   0.1   3.5  0.4    1.5  0.05   1.95   0      3.3  0.3   3.45 0.35
2.3   0     2.15 0      2.05 0      3.15   0.2    1.9  0     3.15 0.25
1.95  0     2.3  0      3.3  0.3    0.05   0.7    2.2  0     3    0.15
ods graphics / width=600px height=150px;
title "Estimate Center and Radius of Circle That Fits Data";
proc sgplot data=Circle;
   scatter x=x y=y;
   xaxis grid values=(0 to 4 by 0.5);
   yaxis grid values=(0 to 1 by 0.1);

The goal of this article is to show how find the center and radius of the circle that best fits the data by using an optimization approach. I do not claim to have invented this method, and I'm sure that there are other ways to do this computation.

Defining the objective function

You can define the "circle of best fit" as the circle that minimizes the radial distances between the data and the circle. Consider a circle with radius R and center (x0, y0). If {(x1, y1), ..., (xn, yn)} are the observed ordered pairs, then the goal is to minimize the following function, which computes the sum of the squared differences between data and the circle:
F(x0, y0, R) = Σi [  (xi-x0)2 + (yi-y0)2 - R2 ]2

You could also insert a square-root sign and compute the difference between R and the distance to the center. My formulation leads to simpler computations of partial derivatives, which I leave as an exercise.

Obviously this function reaches a minimum when the partial derivatives with respect to the parameters vanish. If you set the partial derivative ∂F/∂R equal to zero, you discover that the best-fit circle has a squared radius that is the mean of the squared distances from the data to the center of the circle. In symbols, the optimal R value satisfies
R2 = (1/n) Σi (xi-x0)2 + (yi-y0)2.

Thus the problem reduces to a minimization over two parameters. You can use the nonlinear optimization methods in SAS/IML to solve this minimization problem for a given set of data. The following statements define the objective function. (Notice that the GLOBAL statement is used to access the data.) Based on the scatter plot of the data, I provided an initial guess of (x0, y0) = (2.5, 5). In previous blog posts, I have used Newton-Raphson optimization (the NLPNRA call) for the optimization, so for a change of pace I will use a trust region method (the NLPTR call) to solve this problem.

proc iml;
start ObjF(p)  global(x, y);    /* x and y are fixed vectors of data */
   x0 = p[1];  y0 = p[2];       /* parameters to optimize */
   d2 = (x-x0)##2 + (y-y0)##2;  /* vector of distances from pts to center */
   R2 = mean(d2);               /* square of optimal radius */
   F = ssq(d2 - R2);            /* minimize F = SSQ differences */
use Circle;  read all var {x y};  close Circle;  /* read observed data */
p = {2.5 5};                           /* initial guess for (x0, y0) */
opt = {0 1};                           /* options=(min, print level) */
call nlptr(rc,result,"ObjF", p, opt);  /* find optimum (x0, y0) */
x0 = result[1]; y0 = result[2];
R2 = mean( (x-x0)##2 + (y-y0)##2 );    /* compute optimum R##2 */
ParamEst = x0 || y0 || sqrt(R2);
print ParamEst[c={"x0" "y0" "R"} label="Parameter Estimates"];

For these data, the circle of best fit has center (2, 3) and radius 3. The following scatter plot overlays the optimal circle on the data:


As with many nonlinear optimization problems, it is important to use a good starting guess. For the data in this example, the solution seems to be robust to poor initial guesses. However, for some distributions of the data, there might be multiple local optima.

Post a Comment