Box & Lucas: Designed experiments for nonlinear models

All this month, I'm writing about George E.P. Box, as part of the celebration of the International Year of Statistics. Last week, I wrote about Box-Behnken designs for fitting response surface models. In this post, I want to tell you about the paper Box wrote in 1959 with H. L. “Curly” Lucas.

I was interested in this paper for several reasons. First, Lucas is the father of one of my colleagues at SAS, Bob Lucas. I was just talking to Bob about this blog post, and he told me that his father was called Curly not because he had curls but because he looked like Curly from the Three Stooges. I was also interested in this paper because it is one of the first papers on how to design experiments for models that are nonlinear in the parameters.

How does the paper begin?

The paper starts with an example of the kind of problem the authors want to address. They imagine a chemical reaction where a substance A decomposes to form substance B, which in turn decomposes to form substance C. The yield, y, of B as a function of time, t, is
y = θ 1 { exp (θ 2 t )- exp( θ 1 t)}/ ( θ 1 - θ 2 )

The problem is to choose reaction times at which to observe the yield in order to get the most precise estimates possible of the unknown parameters, θ 1 and θ 2 .

What happens next?

A short subsection then deals with the practicalities of choosing ranges for the factors. For example, in the reaction example, the time must be positive. The following section contains the main analytical development. Here, Box and Lucas introduce what amounts to a local D-optimality criterion for the nonlinear model. They point out that the information in the design depends on the unknown values of the parameters, so you need an initial guess at the parameters to get started.

They also suggest two ideas for further work that seem prescient. One idea was to consider a prior distribution on the unknown parameters. This naturally leads to a Bayesian D-optimality criterion, which is what the JMP nonlinear designer uses.

Their second idea was to consider a sequential approach whereby the researcher augments the current data by adding points that are optimal with respect to the current parameter estimates. Then, new responses are acquired, and the model is updated. Assuming the fitted parameters change, there would be another set of optimal runs to do. This is current practice in some dose response modeling studies.

At this point, they return to their motivating example, making guesses of that θ 1 = 0.7 and θ 2 = 0.2. Given these parameter values, they plot the yield as a function of time.

Figure 1 shows the JMP Graph Builder plot of their function. The two red plus signs are at the optimal two times for getting precise estimates of the parameters given that the initial guesses are close to being correct.

Figure 1: Plot of yield as a function of time.

Figure 2 shows the way to enter a function for use with the nonlinear designer in JMP.

Figure 2: JMP Formula editor view for entering nonlinear models.

Table 1 shows the locally optimal design produced by JMP using initial parameter values of 0.7 and 0.2.

Table1: Locally optimal time points for measuring the yield found using the nonlinear designer in JMP.

In the paper, Box and Lucas report the two optimal times as 1.23 and 6.86. Pretty close for 1959!

How does the paper end?
The rest of the paper deals with applying there methods to other examples with one or two unknown parameters. In their discussion at the end of the article, they make one more really interesting point. Sometimes, the parameters enter the model in such a way that even the best design is incapable of producing precise parameter estimates. They point out that in many cases, the predicted responses are still good even if the parameters are not nailed down due to high correlations among them. From there, it would only take a small step to suggest an optimality criterion that minimizes the average variance of the responses.

G. E. P. Box and H. L. Lucas (1959) “Design of Experiments in Non-Linear Situations” Biometrika Vol. 46 No. 1 pp. 77-80.

Post a Comment

JMP data modeling challenge launched by ENBIS

ENBIS, the European Network for Business and Industrial Statistics, has launched the next JMP Challenge sponsored by JMP. You have until June 15 to submit your solution showing how to turn data into practical value.

The ENBIS Challenge by JMP is a yearly activity aimed at promoting sound practices of statistical modeling. In the past, participants competed against one another by submitting their solutions to a predefined problem using given data. This year's challenge is different: You can submit a solution to any data modeling problem using your own data. Everybody can participate, no matter if you are from industry or academia, regardless of where you live and whether you are ENBIS member or not.

Some rules apply, though:

  • Consider all aspects of the model-building process and focus on real-life business problems.
  • Be creative and smart. Use an innovative, simple but comprehensive approach.
  • No preset topic. No fixed submission format. Use any software tool.

The winners of up to three prizes will be announced during the ENBIS-13 conference in Ankara, Sept. 15-19. One special prize may be given for a creative solution using JMP tools.

Good luck!

Post a Comment

Making change happen in teaching statistics

Educators from around the world will gather this week at the Embassy Suites Hotel and Conference Center in Research Triangle, NC, to attend this year’s US Conference on Teaching Statistics (USCOTS), May 16-18. They will hear from thought leaders and practitioners such as Xiao-Li Meng, Dean of Harvard University's Graduate School of Arts and Sciences, and Chris Wild, Professor of Statistics at the University of Auckland, New Zealand, on the future of teaching statistics.

SAS, including JMP, is actively involved with USCOTS 2013. Besides attending and exhibiting, JMP will be hosting a banquet on the campus to show support for statistics and the people who teach it.

Curt Hinrichs, JMP Academic Marketing Manager, said: “Well over 90 percent of students taking any statistics course in the United States take an introductory service course that satisfies a graduation requirement. These are students in a variety of academic majors who may think stats is just another math course they have to take. Changing these views and making this experience a positive and compelling one are important to the long-term acceptance of data-driven problem solving. USCOTS is the premier undergraduate statistics education event here in the United States that is dedicated to promoting pedagogical innovations that improve student engagement and learning of statistics concepts. SAS and JMP are proud to be a major supporter of this event and these goals."

Recently, I had a chance to ask Allan Rossman, Program Chair and Professor of Statistics at California Polytechnic State University, what USCOTS is and why it's important to the teaching of statistics.

What is USCOTS all about?

USCOTS is about bringing together people who teach statistics, so they can exchange ideas about how to do even better at that important and challenging task. We focus on teaching statistics at the undergraduate level, including AP Statistics that is taught in high schools. We also consider the important issues of preparing teachers of statistics at K-12 levels and conducting educational research into how students learn statistics.

We aim to model good teaching in all aspects of the conference. We have four plenary presentations and many breakout sessions that are designed to be interactive and to provide participants with take-home materials for use with their students. We also have poster sessions that enable teachers to show and have conversations about new ideas and best practices in teaching statistics.

Who is attending?

We're expecting more than 400 attendees, statistics teachers and education researchers from across the country and some from other parts of the world. Most of the participants are college and university faculty, also including teachers from two-year colleges and high schools.

Why is it important to teach statistics?

Understanding basic ideas of data and chance is essential to leading a well-informed life in today's information society, and future professionals in a wide variety of disciplines need to learn how to collect and analyze and draw conclusions from data. Statistics has come to be viewed as a popular field recently, thanks to businesses such as Google and amazon.com and to individuals such as Nate Silver who achieve remarkable success by gaining insights from data. We who teach statistics want to convey to our students how worthwhile and interesting our field is.

What will attendees come away with from the event?

We hope that attendees will leave the conference having had stimulating conversations about how to teach statistics well. They'll come away with specific materials to use in class and with concrete suggestions for changing their courses and curricula. We trust that participants will also emerge with thought-provoking ideas to occupy their minds for days and weeks and months to come. We also hope that people will come away with a network of colleagues and friends with whom they can continue to converse about teaching statistics.

Where is the future for USCOTS and the teaching of statistics?

The theme of this year's USCOTS is "Making Change Happen." This is an exciting time and perhaps a very changeable time for education in general and for the teaching of statistics in particular. The "big data" phenomenon may well change how and what we teach in undergraduate statistics, and evolving software and computing capabilities are changing what and how we teach, and revised curricula at the K-12 level provide great opportunities for change at the undergraduate level. I'm not sure where the teaching of statistics is heading, but I'm looking forward to hearing many perspectives on this important question at USCOTS.

Post a Comment

Celebrating George Box and Box-Behnken designs

As part of the International Year of Statistics, the JMP Blog is honoring influential statisticians each month. Professor George E.P. Box is the honoree for May. Last week, I wrote about on the first of his two-part paper with J. Stuart Hunter on the family of regular two-level fractional factorial designs that was published in Technometrics in 1961.

In this post, I focus on the famous Box-Behnken designs, which are very popular designs for fitting quadratic response surfaces. Box-Behnken designs are notable in that each factor is restricted to three levels – just enough to allow for fitting a quadratic term in each factor. Another notable feature of these designs is that each run other than the center runs has at least one factor set to zero in scaled units. This means that there are no runs where every factor is at one of its extreme values. This is in stark contrast to the regular two-level fractional factorial designs where every run has every factor at either -1 or +1 in scaled units.

How does the paper begin?

The paper starts by pointing out that quantitative factors could be set to a theoretically infinite number values. Though they admit that there is no “essential need to restrict” to a few levels, they argue that convenience requires the use of just a few levels.

They go on to introduce the concept of a “redundancy factor,” which is the fraction by which the number of runs in a design exceeds the number of parameters in the model of interest. They point out that the number of parameters in a polynomial of degree, d, in k factors is (k+d)!/(k!d!). In general, a full-factorial design has substantially more runs than necessary to fit the required number of terms. They point out that using the three-level full factorial design for five factors for fitting a full quadratic model requires 243 runs. But this model only has 21 unknown parameters to estimate. So, the full-factorial design has more than 11 times as many runs than are needed. They conclude: “In situations in which the experimental error variance is not so large as to require large numbers of observations to obtain necessary precision, designs having small redundancy factors are desirable.” I could not agree more!

The ability to keep the redundancy factor small while providing a design that allows for fitting a full quadratic model provides the motivation for their new class of designs.

What happens next?

Having motivated the need for a new family of designs having three levels for each factor and capable of fitting a full quadratic model, they turn to the clever design construction idea that results in their new class of designs. Their idea was to combine two-level factorial designs with balanced incomplete block designs “in a particular manner.”

A Box-Behnken design has groups of runs where for each run in the group only a certain number of factors change. For this group of runs, all the other factors are set at zero in scaled units. The identity of the factors that vary in each group of runs changes from one group to the next. For example, in the first group of runs, x1 and x2 might be varied, and in the next group of runs, x3 and x4 might be the variable factors.

The complete pattern of these changes is described by a balanced incomplete block design. These designs have the property that every treatment occurs the same number of times and every treatment occurs in a block with every other treatment the same number of times. In the Box-Behnken design, the “treatments” are the varying factors in a group. The blocks are how many factors are allowed to vary in a group.

Can you show an example?

Figure 1 shows a Box-Behnken design for four factors that appears as Table 6 in their paper. The first column is the Block column, which shows how to block the Box-Behnken design into three orthogonal blocks. Each block contains two groups of four runs plus a center run.

Figure 1: Box-Behnken design for 4 factors with response values

Note that in the first group x1 and x2 vary. In the second group, x3 and x4 vary. In the third group (second block)  x1 and x4 vary. In the fourth group, x2 and x3 vary. In the fifth group, x2 and x4 vary, and in the last group, x1 and x3 vary. Each factor varies in three groups, and each factor varies once in combination with every other factor. This pattern of the varying factors matches a balanced incomplete block plan with four treatments (the factor identities) and six blocks (the groups of runs), where there are two runs per block. The two “runs” are the two factors that vary in the groups.

The groups of runs are a 2x2 full factorial in the two factors that are varying.

How does the rest of the paper go?

After introducing the basic idea, they show how to block these designs orthogonally and how to add center runs to reduce the prediction variance in the center of the design region. They give examples of their family of designs for 3- 7, 9-12 and 16 factors. The designs for more than seven factors require substantially more than 100 runs and do not appear as options in the Response Surface design options in JMP.

After introducing the designs, they also show how to analyze data generated using these designs. This includes providing the estimates of the coefficients with their accompanying standard errors. This part of the paper does not use the matrix formulation for finding the least squares estimates and their standard errors. The four-factor case shown in Figure 1 has 15 coefficients to estimate – the intercept, four linear effects, six two-factor interaction effects and four quadratic effects. The modern approach is to create the model matrix, X, having 15 columns (one for each coefficient) and as many rows as there are runs in the design. The part that computers of 1960 could not easily do was to find the inverse of the matrix, X’X, which is required for both computing the coefficient estimates and their standard errors. To avoid this computation complication, they provide a table of constants for each design and a computational approach using these constants and various sums of products of the y’s  and the x’s.

Figure 2 shows the parameter estimates and their associated standard errors for the table in Figure 1. All the coefficients match those in the paper. However, the paper does not calculate the standard error of the quadratic effects for the four-factor example correctly. So, the paper reports these standard errors as .66 when the correct value is .63

Figure 2: Estimated coefficients and standard errors.

Have Box-Behnken designs stood the test of time?

Box-Behnken designs are still very popular. I would recommend the three- and four-factor designs with a few caveats.

It is important to remember that these designs do not run tests at the extremes of every factor. So, predictions at these points are actually extrapolations outside the region of experimentation. Box-Behnken designs are more properly thought of as designs on a sphere rather than designs on a cube. However, I suspect that many practitioners actually use these designs to predict while allowing every factor to vary between -1 and 1 in scaled units. That is, they are thinking of this design region as cubic.

Figure 3 shows the settings of an I-optimal design for 4 factors in 27 runs – the same number as are in the Box-Behnken design in Figure 1.

Figure 3: I-optimal design with 4 factors and 27 runs for fitting a full quadratic model.

 

Note that four of the runs have all four factors at their extreme settings. There are also five center runs. Only two of the runs (other than the center runs) have more than one factor at the 0 value in scaled units.

Figure 4 shows the Fraction of the Design Space Plot for the Box-Behnken design. The red curve shows the relative prediction variance for the Box-Behnken design, and the blue curve shows the relative prediction variance for the I-optimal design.

Figure 4: Fraction of Design Space plot comparing I-optimal and Box-Behnken design predictions over the entire cube.

The maximum prediction variance for the Box-Behnken design is 2.3333σ2 and each vertex of the factor space has this variance. For the I-optimal design, the maximum prediction variance is 0.832647σ2 at the factor setting [1, 1, 1, 1] among others. The average variance of prediction over the cube for the Box-Behnken design is 0.4σ2. For the I-optimal design, it is 0.273σ2. It is pretty clear that if you want to be able to predict over the entire cube, the Box-Behnken design is being dramatically outperformed by the I-optimal design.

On the other hand, if you restrict yourself to predictions inside a sphere with a squared radius of 2, then the Box-Behnken design is very efficient. One final note is that even if you consider region of interest for the Box-Behnken design to be cubic, this design is relatively resistant to prediction bias due to active third-order effects.

References

Box, G. E. P. and Hunter, J. S. (1961) "The 2k-p Fractional Factorial Designs Part I" Technometrics Vol 3, No. 3 311-351.

Box, G.E.P. & Behnken, D.W. (1960). Some new three level designs for the study of quantitative variables. Technometrics 2, pp. 455-475.

Post a Comment

Using JMP to connect to database, insert records

I've recently gotten a few questions from customers about how to perform database operations using JMP. One question is whether you can use JMP to connect to databases. With JMP, yes, you can connect to an ODBC-compliant database using JMP Scripting Language or the Open database dialog interface. If  you have the appropriate ODBC driver installed (and the Data Source Name or DSN has been defined), then using the File -> Database -> Open Database command will connect to the database and open the specified table within JMP. This video shows how to configure an ODBC DSN using a Microsoft Access database file.

In the example for this post, I connected to a Microsoft Access database using a machine data source name and selected the database table tblVehicle. I could also have connected to an ORACLE DBMS, MySQL database or even an SQL Server database.

JMP Database Open Table Dialog Window

If you prefer not to use the menu option, you could also do the same using JSL. In this example script provided for demonstration purposes, I connected to the Microsoft Access database and returned all the records in the table titled "tblVehicle" (see above).

//Connect to database and select table

Open Database(

"DSN=MS Access Database;DBQ=C:\Users\stkopr\Documents\MovinOn.accdb;DriverId=25;FIL=MS Access;MaxBufferSize=2048;PageTimeout=5;UID=admin;",

"SELECT * FROM tblVehicle",

"tblVehicle"

);

Before database record insert

Another question is I've gotten recently from customers is whether you can use JMP to insert records into a database. Again, the answer is yes, you can. You can perform some analysis with JMP and then insert an analysis results table or even a single individual record back to the database.

The demonstration script for inserting a single record into a database table is shown below.

// Connect to database and insert new record

Open Database( "DSN=MS Access Database;DBQ=C:\Users\stkopr\Documents\MovinOn.accdb;DriverId=25;FIL=MS Access;MaxBufferSize=2048;PageTimeout=5;UID=admin;",

"INSERT INTO tblVehicle

(VehicleID , LicensePlateNum, Axle, Color)

VALUES ('TRK-199', 'JMP 429', 4, 'Blue');");

Using the same JSL submitted previously, I can display the table in JMP with the newly inserted record selected.

JMP data table after record inserted via SQL

You could also order the records, subset the data with a where clause, submit a nested subquery or delete a single record -- all from within the comfort of JMP.

Once you have completed all your database actions, remember to close the database connection. Again, this can be done using the Disconnect button from the Open Database dialog window or using JSL.

More information on using the JMP ODBC interface can be found in the Help documentation under “Import Data from a Database.”

Let me know if you have used this feature to insert a record. Or do you typically write out the entire table back to the database?

Post a Comment

Why attendees love Discovery Summit

Some have called Discovery Summit “the best conference I’ve ever attended.” While that alone is an excellent recommendation, others have even more to say.

A 2011 attendee said, “While listening to one speaker, I learned of a different methodology to analyze a particular messy study at work. That one moment of enlightenment added significant value to my employer.” And this experience is not unique.

In 2012, one attendee said, “I get more out of Discovery than any professional conference I have attended in my 16-year career.”

The combination of speakers, presentations, and networking opportunities mean that, at Discovery Summit, a lot of value is packed into a few days. This means that attendees have plenty of chances to meet other JMP users, learn new techniques and methodologies, and gain experience with the latest features and applications of JMP.

But don’t take my word for it. Hear why attendees love Discovery Summit in this video:

 

With this year’s lineup, featuring New York Times blogger Nate Silver, statistician Dick De Veaux, and co-founder and Executive Vice President of SAS John Sall, who will offer a first look at the latest version of JMP, the comments are sure to be just as positive.

Post a Comment

Young researchers set to take Phoenix by storm

From May 12-17, 2013, more than 1,500 high school students from over 70 countries will compete in the annual Intel International Science and Engineering Fair (ISEF) in Phoenix, Arizona. The world's largest pre-college science and engineering competition,  the fair hosts the brightest and most talented young researchers in science, engineering and math.

In last year's competition, 15-year-old Jack Andraka of North Country High School in Glen Burnie, Maryland, won first place, the best of the best award, for his work developing a new non-invasive detection tool for pancreatic cancer.

Congratulations to all 2013 participants. Have fun, and best of luck in this year's competition!

Visit the Intel ISEF website for more information, including past winners, categories and sponsors.

And stay tuned to hear about winners of the 2013 competition. Top teams in the Mathematics and Statistics category will receive software and other prizes from JMP!

Post a Comment

Celebrating statisticians: George E.P. Box

In this International Year of Statistics, we at JMP are celebrating famous statisticians on a monthly basis. This month is my turn, and early this year I chose Professor George E.P. Box as the subject of my celebration. I was looking forward to writing this piece because I knew George personally and have been an admirer of his since the beginning of my career.

Sadly, George passed away in late March, and I wrote a remembrance of him for the JMP Blog at that time. That blog post expresses what I would have written in a post celebrating him. So, instead of speaking in general about his life and accomplishments, in this post I will focus on one of his many great papers. My plan is to write several such blog posts this month, each emphasizing a different one of his wonderful publications. One of the benefits for me is that I get to reread these papers.

In this post, I want to focus on the first of his two-part paper with J. Stuart Hunter on the family of regular two-level fractional factorial designs that was published in Technometrics in 1961.

This seminal paper is 40 pages long, and one thing I found notable about it was that the mathematical content did not go past arithmetic and a little algebra! Despite this, there are many fundamental results in this paper, but all are stated in natural language without formal proofs. That was refreshing.

How does the paper begin?

The paper starts with a brief exposition of two-level full factorial design in k factors. It shows how these designs can estimate interactions of all orders up to the k-factor interaction. This provides the motivation and background for introducing a half fraction of the full factorial design. They illustrate the construction method using the 2(4-1) design showing how one starts with the full factorial design in three factors and then adds a fourth factor by computing the elementwise product of the first three factors.

Can you show this in JMP?

To reconstruct their example the way they did it by using JMP, we start by using the Full Factorial designer. We call our factors 1, 2 and 3 and our response Y. We compute the 4th column using the formula editor. The Y column in our Table 1 below has the same values as the ones they use in their Table 3.

Table 1: Formula column for factor 4 as the elementwise product of the first three.

Of course, we could also just use the Screening Designer in JMP to enter 4 factors. The design we want is the first in the list. :-)

What happens next?

They now have a design with 8 runs that is just half as many runs as are in the full factorial design with four factors. With the full factorial design, you can estimate 16 effects – the overall average, 4 main effects, 6 two-factor interactions, 4 three-factor interactions and 1 four-factor interaction (16 = 1 + 4 + 6 + 4 + 1). Now with 8 runs, you can only estimate 8 effects. It turns out that the construction the authors use confounds the 16 effects of the full factorial into 8 pairs of effects. The average is confounded with the four-factor interaction. The 4 main effects are each confounded with one of the 4 three-factor interactions. Finally, the 6 two-factor interactions are confounded in 3 pairs (8 = 1 + 4 + 3).

Figure 1 below shows the analysis from the JMP Screening platform. The values JMP reports are half of the quantities Box and Hunter report, because they define their effects as being the difference in the response when changing from one level of the factor to the other. JMP defines an effect as the change in the response due to a one-unit change in the factor. Since one level of the factor is coded -1 and the other is coded +1, each factor changes by two units going from its low to its high level. Thus, the effect of a one-unit change is half the effect of going from low to high.

Figure 1: JMP screening analysis showing aliases for the two-factor interactions

How does the rest of the paper go?

Of course, the paper is much too long for me to cover everything Box and Hunter introduce – especially not in this level of detail. Here are some of the big concepts:

  • Generalizing their 4-factor example, they show that the best way to create a half fraction of a k factor full factorial design is to start with a full factorial design in k – 1 factors and then calculate the last column by computing the elementwise product of the original k – 1 columns. They also show that one can reconstitute the full factorial by combining a half fraction with another half fraction that where every value in the second fraction is obtained by multiplying the corresponding value in the first fraction by –1. This leads to the concept of a foldover design – a term they also introduce here.
  • They introduce the idea of design resolution and define resolution III, IV and V designs. Introducing the idea of a saturated design, they describe resolution III designs of 7 factors in 8 runs, 15 factors in 16 runs and 31 factors in 32 runs. They also throw a bone to Plackett and Burman (1946) mentioning their constructions of 11 factors in 12 runs, 19 factors in 20 runs, 23 factors in 24 runs, etc.
  • They introduce the idea of design generators and use this idea to show how to block the fractional factorial designs in groups of runs that each have 2, 4, 8 or some other power of 2 runs per block.
  • They show how to obtain designs of resolution IV by folding over a design of resolution III and introduce the idea of design projectivity. For example, they state that every resolution IV design projects to a full factorial (or replicated full factorial) in any three of the factors. The benefit of this is that if only three factors turn out to be important, it is possible to estimate all the interaction effects of those three factors. And, it does not matter which three are important.

Where has design for screening gone in the 50+ years since then?

It is a tribute to the combined power and simplicity of this approach that the regular two-level fractional factorial designs are still in frequent use today. The construction and analysis of these designs does not require a computer, which made them popular when computers were rare. Of course, the calculations can be a bit tedious, so having a computer do them for you makes for fewer errors and more free time.

In the same year as the publication of this paper, Hall published 5 different orthogonal arrays for 15 factors in 16 runs. The saturated design in the Box and Hunter’s paper was one of the 5. This paper was also fundamental as it turns out that all the orthogonal arrays 16 runs for fewer factors are projections of the Hall arrays.

Forty years later, Sun, et al. (2002) catalogued all the orthogonal 16 run designs for 5 to 14 factors. For 9 to 14 factors, the 16 run designs of Box and Hunter are all of resolution III, which means that main effects are confounded with two-factor interactions. Sun, et al. found designs in these cases where none of the two-factor interactions confounds a main effect. Instead, some two-factor interactions may be correlated either plus or minus one-half with a main effect. The benefit of these designs is that main effects can be identified without the built-in ambiguity that resolution III designs entail.

References

Box, G. E. P. and Hunter, J. S. (1961) "The 2k-p Fractional Factorial Designs Part I" Technometrics Vol 3, No. 3 311-351.

Hall, M. Jr. (1961). Hadamard matrix of order 16. Jet Propulsion Laboratory Research Summary, 1, 21–26.

Sun, D. X., Li, W., and Ye, K. Q. (2002), “An Algorithm for Sequentially Constructing Non-Isomorphic Orthogonal Designs and Its Applications,” Technical Report SUNYSB-AMS-02-13, State University of New York at Stony Brook, Dept. of Applied Mathematics and Statistics.

Post a Comment

Is your data too precise?

There is usually a desire to have the most precise measurement of any measurement. In theory, that is good, but for the purposes of data analysis, more precision isn't always better.  

It is usually best to examine any continuous variable and determine a reasonable precision for the recorded values. For instance, suppose I have a variable X, and X has the values shown in the table below:

The data is recorded to 10 decimal places. But if we are analyzing this data, do we really need that many digits to the right of the decimal? One way to examine this is to look at the plot of the estimates of the mean and standard deviation of this data, as a function of the level of rounding used. A good question to ask is "How sensitive are the summary statistics to the level of precision?"


A rule of thumb

The table and the chart show that we don't lose very much information in the estimate of the mean even if we round all the way to one decimal place. A good rule of thumb about the needed precision for a variable is to divide the standard deviation of the unrounded data by 3, and use the leading significant digits decimal place as the level of rounding. In the example shown here, the standard deviation of the unrounded data is 0.3612, so 1/3 of that is 0.12, indicating that one decimal place is sufficient.

Does this matter when building models?

Recently, I encountered a predictive modeling problem where the time it took to fit a decision tree model was longer than desired (many hours of computational time). The data set being used was fairly large (several million rows). Investigating the problem, I discovered that many of the continuous predictor variables (the Xs, or the factors) were recorded out to many decimal places. Using the rounding rule of thumb, the computation time for fitting a decision tree model decreased to several minutes. Why did this happen? The recursive partitioning algorithm that builds decision trees uses the unique levels of each continuous predictor and builds binary splits based on each unique level in order to find the optimum split. If the factors in the model are overly precise, this adds a large amount of computational overhead with often very little benefit in improved model accuracy. Rounding the factors reduced the number of unique levels and made the model fitting algorithm perform much faster.

Binning Data

In practice, it is also known that even less precise indicators of the levels of continuous variables can be useful and lead to better models. One approach to reducing the precision of a predictor variable is to employ binning. Binning simply assigns each continuous variable to a categorical level. Rounding is one example of binning. Another example of binning is to build a histogram of the data and using the bins in the histogram as the predictor. A previous blog post described an interactive binning tool that you can use to do this sort of binning manually. A third example of binning is to use a supervised approach, where the bins in the predictor are chosen in a way that maximizes the predictive ability of the binned variable.

For the data analysis problem I faced, with overly precise data leading to long computational time, I had 100 predictor variables that needed some level of binning applied. The interactive binning approach would have taken too long, and the supervised binning approach is, in itself, computationally intense, so it was taking quite a bit of time. I decided to employ an unsupervised binning approach that simply looks for groupings or clusters in the predictor variables, one variable at a time.

Binning Data Using Normal Mixture Distributions

Consider the slightly "lumpy" variable shown in the histogram below. A Normal Mixture distribution using 3 normal distributions is fit to the data (Hotspot>Continuous Fit>Normal Mixtures>Normal 3 Mixture). The parameter estimates from the mixture distribution are recorded, and a binning formula is created that assigns a row to a group based on which distribution in the mixture the row has the highest associated probability.

The binned variable is an integer that preserves the ordering in the data, so it can be used as a continuous, ordinal or nominal variable.

Click image to see animated version

To automate the process of using normal mixture distributions for unsupervised binning, I created a JMP add-in (available for download on the JMP File Exchange -- requires a free SAS profile).

As part of your data preparation for modeling, take into account the data precision of your predictor variables, and see if lower precision might be helpful as you do your analysis.  I hope the ideas shared in this post are useful to you as you work to build better models.

Post a Comment

JMP attends the PSI Conference in Edinburgh

I had the pleasure of interviewing Richard Zink, Principal Research Statistician Developer in the JMP Life Sciences division, prior to his visit to the UK to speak at the PSI (Statisticians in the Pharmaceutical Industry) Conference in Glasgow on 14 May. His PSI talk is titled "Assessing the Similarity of Subjects Within a Study Site" and will discuss sampling approaches and describe how the availability of extensive computerized logic and validation checks early in the clinical trial not only ensures data quality, but can be used to identify potentially fraudulent activities. Richard has been instrumental in the development of JMP Clinical, especially for Pharmaco-Vigilance (PV), clinical trial fraud, patient narratives and Bayesian methods.

International Conference on Harmonisation (ICH) guidelines suggest that clinical trial data should be actively monitored to ensure data quality. What do you see as the limitations of and issues surrounding on-site monitoring of clinical trials?

Traditionally, this is a very manual process where monitors compare case report forms (CRFs) pages against the physician records. Not only is this time-consuming, but traveling to numerous clinical sites can be extremely expensive. There are also limitations in how the data can be reviewed. When working with paper, it would be extremely difficult to examine trends in variable across time or compare the results of multiple subjects. It is also not possible to compare results across investigator sites.

What is your view of risk-based monitoring of clinical trials?

I think people have interpreted the ICH guidelines very literally and have gotten in the habit of performing 100 percent source data verification (SDV) for all CRF fields. People may spend a lot of time reviewing fields that have little chance of error, and at the end of the day, may have little impact on the findings of the clinical trial. In risk-based monitoring, we would take a random sample of available CRF pages and perform a thorough review of these sampled pages. Only if the number of errors exceeded a certain error rate would more CRFs be sampled. Of course, it would be important to sample from all relevant CRF domains. Further, the sampling fraction may be based on the importance of the data to the study. For example, given their importance, 100 percent of all data for the primary endpoint and serious adverse events (SAEs) may be reviewed.

Given the drive to reduce the cost of clinical trials, how can statistical sampling be used to achieve data quality whilst minimizing cost of analyzing trials?

Hopefully, the sampling will be performed in such a way to minimize the amount of on-site monitoring at the sites. This is beneficial in terms of travel costs, but also in terms of the number of person-hours spent manually reviewing the data. It is also extremely important to perform central monitoring of the data using a robust set of computational tools. This can include checks for outliers or implausible values, but may include more complex analyses to examine trends across time, identify missing data, or identify noteworthy differences between the investigator sites.

What are the emerging trends in discovering fraud in clinical trials, and how do you see tools evolving to meet the requirements for examining data for quality and fraud?

I think fraud has always been an important issue. Fraud, compared to other issues regarding data quality, is different in that there is a deliberate intent to deceive. Other issues regarding data quality may be due to poor training or carelessness. If you have identified unusual values that point to a data quality problem, going the extra step to say that this is necessarily due to fraud is difficult. At the end of the day, whether the problem is due to fraud or carelessness, data quality issues need to be identified early so that appropriate remedies can be applied to minimize disruption to the trial and maintain trial integrity. In the last 10 to 15 years, there have been some publications describing statistical methods used to identify fraud in clinical trials. With numerous competing priorities, implementing these methods in practice may be difficult. Data standards certainly help to ensure that any software developed can apply to different study teams, therapeutic areas or companies. Interactive graphical methods are useful to get as many team members involved, with the ability to drill-down to interesting cases. Of course, there are a lot of ways in which things can go wrong. Making these reviews efficient will be extremely important.

JMP Clinical covers so much more than just improving data quality and uncovering fraud. What are the key capabilities that you would highlight about the software?

I think the software has something for everybody, but what I find very satisfying about the software is its interactivity and the ability to review graphical results and statistical summaries side by side. In addition to the seven fraud detection tools, JMP Clinical has customizable patient profiles and adverse event narratives that allow for more straightforward clinical review and reporting. There is a snapshot comparison feature that allows the user to identify new or modified records as the study database is updated. There is a built-in notes feature that allows users to save and view notes at the analysis-, subject- or record-level. For the more analytically minded, we have a robust set of analyses for adverse events with adjustment using FDR and double FDR for incidence or time-to-event analyses, and a new feature that makes use of Bayesian hierarchical modeling. There is also an extensive set of predictive modeling tools and cross-validation features.

Post a Comment